ChestX-ray14 Dataset and Classification Tasks
Author: Eric Schrock ([email protected]) Dataset: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345 Dataset paper: https://arxiv.org/abs/1705.02315
Overview
This PR adds dataset, binary classification task, and multilabel classification task classes for the ChestX-ray14 dataset.
Testing
python -m pyhealth.unittests.test_datasets.test_chestxray14
python -m pyhealth.datasets.chestxray14 --config "pyhealth/datasets/configs/chestxray14.yaml" --download --partial
Dataset (chestxray14.py and chestxray14.yaml)
- Optionally downloads the dataset from Box (optionally, just a partial download).
- Indexes the image paths to their metadata.
- Initializes the
BaseDatasetclass to better integrate with the rest ofPyHealth. - Provides a basic
__main__block for manual testing.
Binary Classification Task (chestxray14_binary_classification.py)
- Creates labeled samples for any of the 14 diseases in the dataset.
Multilabel Classification Task (chestxray14_multilabel_classification.py)
- Creates multilabel samples from the dataset.
Unit Tests (test_chestxray14.py)
- Fabricates test data to avoid image downloads.
- Tests the public
ChestXray14Datasetmethods. - Tests two sample datasets generated with
ChestXray14BinaryClassification.- One for hernias
- The second for Cardiomegaly
- Tests a sample dataset generated with
ChestXray14MultilabelClassification.
Review Note (source of data download)
The ChestX-ray14 dataset is available from at least two sources, box.com and kaggle. However, these two sources are not identical. The Box version has corrections that have not been copied back to Kaggle (for details see here).
I chose to use Box as the download source, since I currently include the "patient age" in the dataset, which is one of the fields that was corrected. However, I couldn't figure out a way to download the image metadata CSV file from Box (downloads a file preview instead) so I created a publicly available mirror of that file on Google Drive.
If the download from the Google Drive mirror of the Data_Entry_2017_v2020.csv file is not acceptable, I can remove "patient age" from the dataset and use the Kaggle dataset instead. #343 shows an example of downloading ChestX-ray14 from Kaggle.