mimic-code icon indicating copy to clipboard operation
mimic-code copied to clipboard

Are there duplications between patients in CareVue part of MIMIC III and patients in MIMIC IV ?

Open ningyile opened this issue 2 years ago • 5 comments

Prerequisites

  • [x] Put an X between the brackets on this line if you have done all of the following:
    • Checked the online documentation: https://mimic.mit.edu/
    • Checked that your issue isn't already addressed: https://github.com/MIT-LCP/mimic-code/issues?utf8=%E2%9C%93&q=

Description

Description of the issue, including:

  • Recently I conducted a study to predict sepsis mortality. In order to include as many patients as possible, I attempted to merge patients in MIMIC III (2001-2012), MIMIC IV (2008-2019), and eICU database into one cohort. I have noticed that MIMIC III and MIMIC IV contain some of the same patients (namely 2008-2012), but currently, as the 'ICUSTAY_ID' (MIMIC III) and 'stay_id' (MIMIC IV) have been desensitized, there is no solution for partially overlapping patients except to go back to the source raw data.

  • But when I reviewed the MIMIC III documentation, I noticed that the data of MIMIC III was divided into CareVue system (which archived data from 2001 - 2008) and Metavision system (which replaced CareVue in 2008 and continued to 2012) according to the chronological order of the data management system in the Beth Israel Deaconess Medical Center (BIDMC). CareVue system and Metavision system in MIMIC III had quite different styles in data storage, which embodied in the 'inputevents_cv' and 'inputevents_cv' table. In addition, we found that the data storage mode of 'inputevents_mv' in MIMIC III was similar to 'inputevents' in MIMIC IV, and both source of the two tables were Metavision ICU databases, which indicated that overlapping patients with MIMIC III (2001-2012) and MIMIC IV (2008-2019) existed from 2008 to 2012 (Metavision replaced CareVue in MIMIC III in 2008).

  • Therefore, I supposed that there is no duplication between patients in CareVue part of MIMIC III and MIMIC IV, whether the target cohort in our study can be included from the CareVue part of MIMIC III, MIMIC IV, and eICU database to exclude duplications between MIMIC III and MIMIC IV database?

  • references to similar issue 1 Linking between databases (MIMIC 3 and MIMIC 4) and similar issue 2 subject_ids between mimic iii and mimic iv.

ningyile avatar Jul 04 '22 17:07 ningyile

  • Therefore, I supposed that there is no duplication between patients in CareVue part of MIMIC III and MIMIC IV

This is where the logic breaks down :) There are people who were admitted in 2001-2008, and also later admitted from 2008 - 2012. For now, there's no way for you to guarantee that an extraction from MIMIC-III has entirely independent patients compared to MIMIC-IV.

I made a new PhysioNet project called MIMIC-III CareVue which explicitly removes the MIMIC-IV patients. This would allow merging the datasets - albeit, it creates a slightly acausal dataset as you remove patients in 2001 - 2008 using "future" knowledge that they are admitted from 2008 - 2019. It's currently unsubmitted at PhysioNet (nearly ready!), but once reviewed and published I'll reference it here.

alistairewj avatar Jul 13 '22 13:07 alistairewj

  • Therefore, I supposed that there is no duplication between patients in CareVue part of MIMIC III and MIMIC IV

This is where the logic breaks down :) There are people who were admitted in 2001-2008, and also later admitted from 2008 - 2012. For now, there's no way for you to guarantee that an extraction from MIMIC-III has entirely independent patients compared to MIMIC-IV.

I made a new PhysioNet project called MIMIC-III CareVue which explicitly removes the MIMIC-IV patients. This would allow merging the datasets - albeit, it creates a slightly acausal dataset as you remove patients in 2001 - 2008 using "future" knowledge that they are admitted from 2008 - 2019. It's currently unsubmitted at PhysioNet (nearly ready!), but once reviewed and published I'll reference it here.

It was unexpected to me. I noticed mimic3 V1.4 dataset was a combination of CV(2001 - 2008) and MV information system(2008 - 2012). But if MIMIC3 can combine the CV and MV, why not MIMIC4 (which utilized MV information system entirely) can be merged with the CV part of MIMIC3? Anyway, your new project MIMIC-III CareVue is an exciting job, and I look forward to the release at Physionet.

ningyile avatar Jul 20 '22 06:07 ningyile

  • Therefore, I supposed that there is no duplication between patients in CareVue part of MIMIC III and MIMIC IV

This is where the logic breaks down :) There are people who were admitted in 2001-2008, and also later admitted from 2008 - 2012. For now, there's no way for you to guarantee that an extraction from MIMIC-III has entirely independent patients compared to MIMIC-IV.

I made a new PhysioNet project called MIMIC-III CareVue which explicitly removes the MIMIC-IV patients. This would allow merging the datasets - albeit, it creates a slightly acausal dataset as you remove patients in 2001 - 2008 using "future" knowledge that they are admitted from 2008 - 2019. It's currently unsubmitted at PhysioNet (nearly ready!), but once reviewed and published I'll reference it here.

Is there a possibility that some patients admitted in 2008 had data in both CV and MV systems when CV and MV were being switched? I still remember when I was writing my SQL code last year and I found that some patient data existed in both CV and MV related forms. Well, if that's the case, it makes sense.

ningyile avatar Jul 20 '22 06:07 ningyile

We could have merged CareVue with MetaVision for MIMIC-IV, but it was becoming increasingly more work for increasingly less value. Keeping CareVue around basically doubles the effort of all data extraction work, not to mention the build code.

In MIMIC-IV it's only MetaVision. Good question about the overlap - there shouldn't be an appreciable number as we only include subjects who have an ICU stay in the MetaVision system or an ED stay.. but as always I'm sure there are edge cases here and there.

alistairewj avatar Jul 20 '22 14:07 alistairewj

We could have merged CareVue with MetaVision for MIMIC-IV, but it was becoming increasingly more work for increasingly less value. Keeping CareVue around basically doubles the effort of all data extraction work, not to mention the build code.

In MIMIC-IV it's only MetaVision. Good question about the overlap - there shouldn't be an appreciable number as we only include subjects who have an ICU stay in the MetaVision system or an ED stay.. but as always I'm sure there are edge cases here and there.

I fully understand the tremendous amount of work involved in maintaining such a large database. I referred to the previous issues on the topic of overlap for MIMIC3 and MIMIC4, and I confirmed that there were no similar questions about the overlap patients between CV and MIMIC4 before I launched the new inquiry. I thought I could merge CV and MIMIC4 by myself, but now I realize that I have to wait for your new project at Physionet. Thank you for your detailed answers and excellent work! I'm sure many critical care researchers are looking forward to the release of MIMIC-III CareVue project which removes the MIMIC-IV patients at Physionet.

ningyile avatar Jul 20 '22 14:07 ningyile

We could have merged CareVue with MetaVision for MIMIC-IV, but it was becoming increasingly more work for increasingly less value. Keeping CareVue around basically doubles the effort of all data extraction work, not to mention the build code.

In MIMIC-IV it's only MetaVision. Good question about the overlap - there shouldn't be an appreciable number as we only include subjects who have an ICU stay in the MetaVision system or an ED stay.. but as always I'm sure there are edge cases here and there.

Today I saw the CareVue(CV) subset of MIMIC III database you mentioned above was released on Physionet few days ago: MIMIC-III Clinical Database CareVue subset. I specifically noted the methods of this new released database, it seems that you identified the targeted patients‘ (namely non overlapping patient) subject_id, hadm_id, and icustay_id from the CV system(dbsource = 'carevue') and then generate a separate sub-database. It doesn't seem to conflict with what I mentioned in my question about extracting data from CV alone(namely there is no duplication between patients in CareVue part of MIMIC III and MIMIC IV). Anyway, many thanks and respect to Alistair Johnson for your hard work on the new subset database. There is no doubt that a separate CV subset database simplifies our workflow more than if we extracted CV data from the full MIMIC3 database ourselves.

ningyile avatar Oct 02 '22 09:10 ningyile

@alistairewj

ningyile avatar Oct 02 '22 09:10 ningyile

I specifically noted the methods of this new released database, it seems that you identified the targeted patients‘ (namely non overlapping patient) subject_id, hadm_id, and icustay_id from the CV system(dbsource = 'carevue') and then generate a separate sub-database.

Yes, that's correct, but we used non-public information to map MIMIC-IV subject_id to MIMIC-III subject_id in order to do so. As a result, the published CareVue subset is a straightforward extraction of MIMIC-III data for a subset of subject_id. Internally, our build process for MIMIC-IV is also much simpler having removed CV specific code.

Hopefully the new project enables the type of cross-database analysis you are interested in while minimizing data leakage. Good luck!

alistairewj avatar Oct 03 '22 15:10 alistairewj