mimic-code icon indicating copy to clipboard operation
mimic-code copied to clipboard

SQLite import for mimic3 gives mixed column type warning

Open armando-fandango opened this issue 2 years ago • 4 comments

Prerequisites

  • [X ] Put an X between the brackets on this line if you have done all of the following:
    • Checked the online documentation: https://mimic.mit.edu/
    • Checked that your issue isn't already addressed: https://github.com/MIT-LCP/mimic-code/issues?utf8=%E2%9C%93&q=

Description

While trying to import mimic3 into SQLite with import.py, I get the following error:

Starting processing DATETIMEEVENTS.csv.gz
mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
Starting processing INPUTEVENTS_CV.csv.gz
/home/armando/projects/mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (20,21) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
Starting processing NOTEEVENTS.csv.gz
/home/armando/projects/mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (4,5) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
Starting processing CHARTEVENTS.csv.gz
/home/armando/projects/mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...

armando-fandango avatar Jan 25 '22 23:01 armando-fandango

Hi, I also am running the import.py code and I ran into the same problem...

Did you manage to figure it out or find an alternative solution?

pshuwei avatar Jun 09 '23 14:06 pshuwei

It's not strictly an error but it may result in an inconsistent data load (I haven't checked). Essentially the load uses pandas as a convenience. pandas tries a low memory load, fails, and reverts to a high memory load. It can be fixed by specifying the known data types for each table in the read_csv call.

alistairewj avatar Jun 09 '23 18:06 alistairewj

Since the column types are already known in advance and are not going to change since its a frozen/snapshot dataset, hence would it be good to add the column type to the import script? I can send a pull request if this solution is acceptable.

armando-fandango avatar Jun 21 '23 13:06 armando-fandango

Yes it would for sure, and yes we would love a PR!

alistairewj avatar Jun 21 '23 13:06 alistairewj