CommonDataModel
CommonDataModel copied to clipboard
New NLP related tables (proposal)
There are huge amounts of data being generated at hospitals every day. Up to 80% of this data is collected in an unstructured format and a large portion of it as free text. In order to extract value from this data it is necessary to turn it into structured data. This is specially relevant for clinical studies that benefit from all the data capture performed during the clinical practice. Natural language processing (NLP) solutions can structure clinical texts written by physicians, extracting and encoding relevant medical concepts and taking into account complex context such as negations, family/personal background, past events among others
While OMOP CDM is a great schema to store structured data, NLP results can get messy and complex. Although OMOP CDM v5.4 provides a note_nlp table to store these results, queries to this table can become clumsy and slow, so we designed and extended the OMOP CDM with our own NLP schema to store the results generated in the annotation process of NLP.
We propose this extension of the OMOP CDM to store the output of NLP solutions while integrating with the vocabulary normalization process of the OMOP CDM.
We presented this in the OHDSI Collaborator Showcase and were awarded a Best Community Contribution Award. You can see more details and a presentation here:
https://www.ohdsi.org/2021-global-symposium-showcase-28/
This looks like it will be very beneficial. A couple of questions:
- Is the table called NLP_SPAN or NOTE_SPAN? The documentation uses both.
- There is frequent use of a 'note_entity_id' column but not clear what table (if any) this is a foreign key reference to?
- There is frequent use of a 'shard_id' column but not clear what table (if any) this is a foreign key reference to? Or is this something to support parallel processing?
- The NLP_RELATIONSHIP table has a 'label_id' column in the documentation and a 'lable_concept_id' in the ERD? Will this be a reference to a concept entry in the OMOP concept table that will define the relationship of two found NLP_SPAN/NOTE_SPAN entries (Aside: there are other mismatches between the documentation and the ERD.) But shouldn't the NLP_RELATIONSHIP be between NLP_CONCEPT entries? If the same span of text has multiple concepts attached to it, would it not be the case that you might want to relate one labeled concept to another labeled concept in another span but not all? For example, a span like 'T1' could map to an anatomic site and a staging variable and one might want to relate one of those labeled spans to another labeled span but not all?
- Is there a seed list of NLP relationship concepts that will be used? Would be nice for it to be constrained to an OMOP concept domain to enforce consistency?
- How is it envisioned that this model would be ETL's or transformed into different entries in the clinical data tables? Like CONDITION_OCCURRENCE, PROCEDURE_OCCURRENCE, MEASUREMENT, VISIT_OCCURRENCE, and so on? If it is desired that an NLP output placed in this model is promoted to the actual clinical data tables, is there any connection or provenance between this model and the standard clinical data model? Ultimately all the OHDSI analytic/method tools will reference the standard clinical data tables, so to fully unleash the data in text extracted by NLP, the concepts need to be normalized to the same standard clinical events found in the discrete realm.
Thanks Michael for the thorough revision, as you have noticed, we are renaming some of the elements of the schema and we will be updating the ERD accordingly. Some answers to the documentation issues:
- it is NOTE_SPAN, I have corrected the docs
- it is the primary key of the NOTE_SPAN table, I renamed it to note_span_id
- our system integrates data from different hospitals, and we use partitions to integrate all data, shard_id is our key, but this is specific from our deployment, I removed it from the docs
- yes, we can rename it to relationship_concept_id or something like that
But shouldn't the NLP_RELATIONSHIP be between NLP_CONCEPT entries? If the same span of text has multiple concepts attached to it, would it not be the case that you might want to relate one labeled concept to another labeled concept in another span but not all? For example, a span like 'T1' could map to an anatomic site and a staging variable and one might want to relate one of those labeled spans to another labeled span but not all?
In principle, our NLP system will assign the relationship between the text entities (NOTE_SPAN) but you are right that could be some different scenarios. This is the kind of discussions we would like to foster here.
Is there a seed list of NLP relationship concepts that will be used? Would be nice for it to be constrained to an OMOP concept domain to enforce consistency?
We can share the concepts we are using in our current deployments, and agree we could constrain to an specific domain. We can discuss which could be and how to include the necessary concept ids.
How is it envisioned that this model would be ETL's or transformed into different entries in the clinical data tables? Like CONDITION_OCCURRENCE, PROCEDURE_OCCURRENCE, MEASUREMENT, VISIT_OCCURRENCE, and so on?
Currently, we are integrating the found concepts with specific context (Affirmative, Present, Certain, etc) into the corresponding table given the domain_id of the assigned concept, setting the *_concept_type_id to
OMOP4976931 | NLP | Type Concept
We envision using the FACT_RELATIONSHIP to link those entries to the corresponding NOTE_SPAN.
Hi @clairblacketer, we would love to see this go through, how can we push forward this extension ? should we present this in some WG ? Thanks for your time and guidance.