mimic-omop icon indicating copy to clipboard operation
mimic-omop copied to clipboard

Use this ETL as a way to provide MIMIC in OMOP directly on the Physionet website

Open vojtechhuser opened this issue 5 years ago • 34 comments

This ETL allows local user to download and convert-at-many sites

How about convert-once and allow sites to download the converted dataset.

This would save MIMIC users some effort and make MIMIC more used. (and published about; getting credit).

vojtechhuser avatar Jul 25 '18 16:07 vojtechhuser

Great idea!

dsontag avatar Jul 25 '18 21:07 dsontag

Thank you for this great work!

chandryou avatar Jul 26 '18 00:07 chandryou

Thanks for the suggestion @vojtechhuser. The mapping needs some work, but sharing the transformed dataset is something that we'd like to do once we're happy with it. We haven't been able to give this project the time it needs just because of competing priorities (research tasks, rebuilding PhysioNet, preparing the next release of MIMIC, etc), but it's on our to-do list.

tompollard avatar Jul 26 '18 14:07 tompollard

any updates on this? We would like to use mimic3 in a Data Quality totorial at OHDSI symposium and desperately need someone who ran the code from this repo and can collaborate with us.

vojtechhuser avatar Sep 05 '18 13:09 vojtechhuser

As soon as we publish something on PhysioNet we have to be able to support it and the ETL isn't ready. We are currently building ETLs for other ICU datasets so that our model doesn't overfit to MIMIC.

If by data quality you mean running Achilles, then I have done that, but the results aren't that useful on MIMIC because of the unique data structure and deidentification approach (e.g. deidentified ages ~ 300).

alistairewj avatar Sep 05 '18 15:09 alistairewj

@alistairewj the use @vojtechhuser is referring to is for a tutorial on how to use Achilles and two other data quality tool sets designed for use with OMOP data sources. The version of MIMIC we use doesn't need to be free of defects. It just needs to be usable - i.e. it won't break the tools because there are empty or missing tables or missing required variables. To the extent that it will resemble a real world data set with typical data quality issues that the tools can identify, it will meet our needs. Before I spend the effort to get this to run, can you give your sense of how likely it is to meet those needs?

AEW0330 avatar Sep 07 '18 17:09 AEW0330

MIMIC is a real world dataset, from a real hospital, but I don't know if I can fully answer your question without knowing the ins and outs of the tools you'll use. The ETL is incomplete; there are still a lot of unmapped concepts. I ran Achilles a few months ago and the output is hopefully informative for you (see below). You'll notice that there are a lot of reported "errors" around times/dates due to our deidentification approach (we randomly shift patient data into the future, therefore doing any analysis which aggregates distinct patients over time is flawed).

Type Message
ERROR 3-Number of persons by year of birth; should not have year of birth in the future, (n=44,374)
ERROR 101-Number of persons by age, with age at first observation period; should not have age > 150, (n=1,991)
ERROR 400-Number of persons with at least one condition occurrence, by condition_concept_id; 2 concepts in data are not in vocabulary
ERROR 400-Number of persons with at least one condition occurrence, by condition_concept_id; 228 concepts in data are not in correct vocabulary
ERROR Death event outside observation period, 510-Number of death records outside valid observation period; count (n=8,980) should not be > 0
ERROR 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; 39 concepts in data are not in correct vocabulary
ERROR 610-Number of procedure occurrence records outside valid observation period; count (n=883) should not be > 0
ERROR 700-Number of persons with at least one drug exposure, by drug_concept_id; 4 concepts in data are not in correct vocabulary
ERROR 706 - Distribution of age by drug_concept_id (count = 1); min value should not be negative
ERROR 710-Number of drug exposure records outside valid observation period; count (n=12,437,292) should not be > 0
ERROR 711-Number of drug exposure records with end date < start date; count (n=15,922) should not be > 0
ERROR 717 - Distribution of quantity by drug_concept_id (count = 7); min value should not be negative
ERROR 806 - Distribution of age by observation_concept_id (count = 2); min value should not be negative
ERROR 810-Number of observation records outside valid observation period; count (n=85,787) should not be > 0
ERROR 814-Number of observation records with no value (numeric, string, or concept); count (n=99,839) should not be > 0
NOTIFICATION Unmapped data over percentage threshold in:Measurement
NOTIFICATION Count of unmapped source values exceeds threshold in: drug_exposure
NOTIFICATION [GeneralPopulationOnly] Count of distinct specialties of providers in the PROVIDER table is below threshold
NOTIFICATION No body weight data in MEASUREMENT table (under concept_id 3,025,315 (LOINC code 29,463-7))
NOTIFICATION Unmapped data over percentage threshold in:Condition
NOTIFICATION Unmapped data over percentage threshold in:Procedure
NOTIFICATION Unmapped data over percentage threshold in:DrugExposure
NOTIFICATION Unmapped data over percentage threshold in:Observation
WARNING 5-Number of persons by ethnicity; data with unmapped concepts
WARNING 101-Number of persons by age, with age at first observation period; should not have age > 125, (n=1,991)
WARNING 400-Number of persons with at least one condition occurrence, by condition_concept_id; data with unmapped concepts
WARNING 402-Number of persons by condition occurrence start month, by condition_concept_id; 2 concepts have a 100% change in monthly count of events
WARNING 420-Number of condition occurrence records by condition occurrence start month; theres a 100% change in monthly count of events
WARNING 512-Distribution of time from death to last drug (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 514-Distribution of time from death to last procedure (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 515-Distribution of time from death to last observation (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; data with unmapped concepts
WARNING 602-Number of persons by procedure occurrence start month, by procedure_concept_id; 6 concepts have a 100% change in monthly count of events
WARNING 620-Number of procedure occurrence records by procedure occurrence start month; theres a 100% change in monthly count of events
WARNING 700-Number of persons with at least one drug exposure, by drug_concept_id; data with unmapped concepts
WARNING 702-Number of persons by drug exposure start month, by drug_concept_id; 22 concepts have a 100% change in monthly count of events
WARNING 717-Distribution of quantity by drug_concept_id (count = 83); max value should not be > 600
WARNING 720-Number of drug exposure records by drug exposure start month; theres a 100% change in monthly count of events
WARNING 800-Number of persons with at least one observation occurrence, by observation_concept_id; data with unmapped concepts
WARNING 802-Number of persons by observation occurrence start month, by observation_concept_id; 7 concepts have a 100% change in monthly count of events
WARNING 820-Number of observation records by observation start month; theres a 100% change in monthly count of events

alistairewj avatar Sep 07 '18 18:09 alistairewj

@alistairewj this is helpful. Thanks.

AEW0330 avatar Sep 08 '18 11:09 AEW0330

Any updates on sharing a complete version of mimic in omop on physionet?

Especially now in Covid19 times, I would very much like to work with a proper cdm at home, as I can't access my organisation's cdm. Alternatives databases, like Synpuf, are too limited for the analyses I want to test.

Thank you, Tom

tomseinen avatar Apr 16 '20 11:04 tomseinen

We would be happy to share an OMOP version of MIMIC-III on PhysioNet. See also https://github.com/MIT-LCP/mimic-code/issues/725.

I suggest that someone from the OMOP community takes responsibility for putting together a submission to PhysioNet. The person should:

  • make efforts to describe the dataset clearly.
  • include a snapshot of the code used to generate the dataset.
  • ensure that people who have been involved in the work are included as contributors.

Once we receive a well described version of the dataset, we can move forward with publication. For instructions on submitting the project, see: https://physionet.org/about/publish/#sharing

tompollard avatar Apr 17 '20 19:04 tompollard

That is great. I will work on a revised proposal that I am happy to revise multiple times until I hit all your requirements to the satisfaction of the PhysioNet reviewing team. (tagging @parisni )

vojtechhuser avatar Apr 27 '20 23:04 vojtechhuser

Hi all. Good news. I would be pleased to give some help to make this possible.

parisni avatar May 01 '20 22:05 parisni

Today - I started a draft.

I will add @parisni and other important people.

image

vojtechhuser avatar May 02 '20 15:05 vojtechhuser

I plan to use (let me know if that is wrong) image

vojtechhuser avatar May 02 '20 15:05 vojtechhuser

@vojtechhuser those access settings are correct. Not sure about "OMOP shaped data" as the title of the dataset, but presumably this is a placeholder!

tompollard avatar May 04 '20 17:05 tompollard

The title is changed now. Please let me know who else want to be invited (or not want to be). So far, I have

image

vojtechhuser avatar May 04 '20 17:05 vojtechhuser

What people thing about number of projects. One project will be for full data. Should we create another project that converts Demo data? (I am happy to do what MIT tells me).

image

vojtechhuser avatar May 04 '20 17:05 vojtechhuser

I would like an invite! I would love to be able to skip ETLing the data and getting it in the OMOP format from source.

jmbanda avatar May 04 '20 17:05 jmbanda

If published as a credentialed project then it would be accessible to MIMIC users. The invite mentioned is for the authors of the project, i.e. those who helped create the ETL.

alistairewj avatar May 04 '20 18:05 alistairewj

One project will be for full data. Should we create another project that converts Demo data?

Yes, I think separate projects for each dataset is best. One of the benefits is that the MIMIC demo is open access (https://physionet.org/content/mimiciii-demo/1.4/), so the same permissions could be applied to the OMOP version.

tompollard avatar May 04 '20 18:05 tompollard

Excellent point Tom.

AEW0330 avatar May 04 '20 19:05 AEW0330

based on guidance - I have now created a sister "demo" project and invited folks there too.

image

vojtechhuser avatar May 04 '20 20:05 vojtechhuser

I'm seeing whether the N3C project can support some of this work - pay for some of people's time and get more hand on deck. Who has a guess at the amount of work involved?

AEW0330 avatar May 04 '20 21:05 AEW0330

Folks leading that seem to have some leeway with unspecified cash allocations to fund it - it being the National Covid Cohort Collaborative (N3C) - and indicate potential interest in supporting this. So I'm eager to respond to their question about the amount of work. I'd take a guess myself but I'm the least fit amongst this group to do so.

AEW0330 avatar May 04 '20 22:05 AEW0330

Interesting, thanks Andrew. @parisni @alistairewj @aparrot89 any thoughts on whether we should be putting in additional work to improve the mapping before the dataset is shared?

tompollard avatar May 04 '20 22:05 tompollard

Hi, I am interested to be part of this project and am already a registered user of Physionet.

SSMK-wq avatar May 05 '20 17:05 SSMK-wq

Formal funding would be great.

See notes in this shared folder: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN

For folks willing to help, please put your name next to a table that you volunteer to tackle (port to GBQ or improve)

image

vojtechhuser avatar May 05 '20 19:05 vojtechhuser

I propose a plan were multiple versions are released. We need initial versions to make people aware of it. E.g., v0.1 with some tables. After that - some version (e.g., v1.0 can be using existing mapping) and v2.0 can be with improved mapping. Perfect should not be the enemy of the good enough.

vojtechhuser avatar May 07 '20 03:05 vojtechhuser

I can't say I agree with releasing an incomplete dataset on PhysioNet and justifying the lack of comprehension with a "v0.1" tag.

alistairewj avatar May 07 '20 15:05 alistairewj

google link permission was fixed. You can sign up for individual tables again here: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN (file central notes)

vojtechhuser avatar May 14 '20 22:05 vojtechhuser