mimic-code icon indicating copy to clipboard operation
mimic-code copied to clipboard

How much of MIMIC-IV is MIMIC-III?

Open tsteffek opened this issue 2 years ago • 2 comments

Prerequisites

  • [X] Put an X between the brackets on this line if you have done all of the following:
    • Checked the online documentation: https://mimic.mit.edu/
    • Checked that your issue isn't already addressed: https://github.com/MIT-LCP/mimic-code/issues?utf8=%E2%9C%93&q=

Description

There are multiple issues regarding how much MIMIC-III and MIMIC-IV overlap. Most of those had the goal of merging the two datasets, and that problem is now solved by the CareVue dataset. However, I had trouble finding a good solution for basically the opposite operation; instead of merging III and IV, I'm looking for a way to remove data that was in III and is now also in IV. Background is that our team trained ML models on III and would now like to verify our results on the unseen IV data. I'm sure you're all aware of the problem that results in evaluating on data that has been used for training.

Since there does not seem to be a good way of linking the two datasets to filter for data in III, the question arises: how much of MIMIC-IV consists of MIMIC-III data? Does the 2008-2012 period in MIMIC-IV purely or mainly consist of MIMIC-III data, or was additional data added in that time period?

Also, to verify: is my assumption correct, that after adjusting for the anchor year shift, all data up to and including 2014 is potentially contaminated due to the anchor year group size?

Similar Issues

  • https://github.com/MIT-LCP/mimic-code/issues/1331
  • https://github.com/MIT-LCP/mimic-code/issues/815
  • https://github.com/MIT-LCP/mimic-code/issues/994

tsteffek avatar Apr 23 '23 21:04 tsteffek

Good questions..

Since there does not seem to be a good way of linking the two datasets to filter for data in III, the question arises: how much of MIMIC-IV consists of MIMIC-III data? Does the 2008-2012 period in MIMIC-IV purely or mainly consist of MIMIC-III data, or was additional data added in that time period?

Most of the data in 2008 - 2012 is in MIMIC-III - not all of it since MIMIC-III ceased data collection a bit before the end of 2012, but I can't remember exactly when... definitely in the winter somewhere.

Also, to verify: is my assumption correct, that after adjusting for the anchor year shift, all data up to and including 2014 is potentially contaminated due to the anchor year group size?

Yes I would say so. Further, there are probably patients in MIMIC-III who are also in later years of MIMIC-IV, since a patient can be readmitted. I'll raise it with the lab to discuss. I don't think there is any reason why we can't publish a list of all the subject_id from MIMIC-IV which are also in MIMIC-III, but I haven't thought it through completely yet.

alistairewj avatar Apr 24 '23 18:04 alistairewj

That would be great!

In the meantime, your clarifications are already helping me, thank you for that.

tsteffek avatar Apr 28 '23 11:04 tsteffek