Published on Dec. 1, 2017
Electronic medical records document health information in structured format and in unstructured free text format. Health information in structured format contains laboratory results, vital signs, patient demographics etc. The unstructured free text is the prime source of healthcare information documenting providers’ interpretations of health conditions, diagnoses, medical interventions, impressions, etc. In order to uncover unknown information and search for patterns in health data with computational methods, we need to structure the unstructured free text data. For that, we use information extraction, a computational technique for analyzing free text and deriving structured information. Extracted information from free text can be represented in the form of relational triples. Relational triples are statements of a single fact composed of subject-relation-object. These triple statements allow the development of knowledge bases, knowledge graphs or the application of inference rules. In our research, we employ Stanford’s CoreNLP engine for information extraction in triple format. This format helps us to develop Resource Description Framework (RDF) networks where each subject and object become nodes and the edges represent the relations between the nodes. However, most of the triples produced by CoreNLP convey multiple facts (compound triple), instead of a single fact (atomic triple). Compound triples produce networks with nodes representing multiple entities instead of a single entity causing issues of network representation of our data. Here, we extend the use of CoreNLP to atomize compound triples. Our approach is based on the N-ary relational schema that links an individual to multiple individuals or values. Our approach includes triple decomposition and ontological modeling.