OMOP2OBO icon indicating copy to clipboard operation
OMOP2OBO copied to clipboard

Improve string delimiter detection in mapping pipline

Open callahantiff opened this issue 3 years ago • 1 comments

Describe the Bug

An assumption is made that all concept synonyms and ancestor information will be input in an aggregated format with each aggregated concept separated by a | delimiter. That's a brittle assumption that should be improved. Examples of specs for input data can be found here: resources/clinical_data/README.md

EXAMPLE:
Input Data
The CONCEPT_SYNONYM column below displays data in the expected input format

CONCEPT_ID CONCEPT_SOURCE_CODE CONCEPT_LABEL CONCEPT_SOURCE_LABEL CONCEPT_SYNONYM
37018594 snomed:80251000119104 Complement level below reference range Complement level below reference range Complement level below reference range | Complement level below reference range (finding)

Example of Data that Breaks Assumptions:
The CONCEPT_SYNONYM column below displays data in an unexpected input format (i.e. two types of delimiters | and ;)

CONCEPT_ID CONCEPT_SOURCE_CODE CONCEPT_LABEL CONCEPT_SYNONYM
40771573 loinc:69052-9 Flow cytometry specialist review of results Flow cytometry specialist review of results | Flow cytometry specialist review | Dynamic; Impression; Impression/interpretation of study; Impressions; Interp; Interpretation; Misc; Miscellaneous; Narrative; Other; Point in time; Random; Report; To be specified in another part of the message; Unspecified

Impact Level

LOW - the string similarity mapping pipeline correctly handles all types of pipings allowing the recovery of missed mappings in the exact match part of the pipeline.

Impacted Scripts

omop2obo/clinical_concept_annotator.py

Solution

  • [ ] Add a parameter to pass delimiter type
  • [ ] Improve tests to better vette

callahantiff avatar Oct 09 '20 16:10 callahantiff

  • [x] Temp work around provided for release v1.0, which handles weird LOINC synonym strings in the SQL query

callahantiff avatar Oct 22 '20 19:10 callahantiff