OMOP2OBO
OMOP2OBO copied to clipboard
Improve string delimiter detection in mapping pipline
Describe the Bug
An assumption is made that all concept synonyms and ancestor information will be input in an aggregated format with each aggregated concept separated by a |
delimiter. That's a brittle assumption that should be improved. Examples of specs for input data can be found here: resources/clinical_data/README.md
EXAMPLE:
Input Data
The CONCEPT_SYNONYM
column below displays data in the expected input format
CONCEPT_ID | CONCEPT_SOURCE_CODE | CONCEPT_LABEL | CONCEPT_SOURCE_LABEL | CONCEPT_SYNONYM |
---|---|---|---|---|
37018594 | snomed:80251000119104 | Complement level below reference range | Complement level below reference range | Complement level below reference range | Complement level below reference range (finding) |
Example of Data that Breaks Assumptions:
The CONCEPT_SYNONYM
column below displays data in an unexpected input format (i.e. two types of delimiters |
and ;
)
CONCEPT_ID | CONCEPT_SOURCE_CODE | CONCEPT_LABEL | CONCEPT_SYNONYM |
---|---|---|---|
40771573 | loinc:69052-9 | Flow cytometry specialist review of results | Flow cytometry specialist review of results | Flow cytometry specialist review | Dynamic; Impression; Impression/interpretation of study; Impressions; Interp; Interpretation; Misc; Miscellaneous; Narrative; Other; Point in time; Random; Report; To be specified in another part of the message; Unspecified |
Impact Level
LOW - the string similarity mapping pipeline correctly handles all types of pipings allowing the recovery of missed mappings in the exact match part of the pipeline.
Impacted Scripts
omop2obo/clinical_concept_annotator.py
Solution
- [ ] Add a parameter to pass delimiter type
- [ ] Improve tests to better vette
- [x] Temp work around provided for release
v1.0
, which handles weird LOINC synonym strings in the SQL query