PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

ChestX-ray14 Dataset and Classification Tasks

Open EricSchrock opened this issue 8 months ago • 0 comments

Author: Eric Schrock ([email protected]) Dataset: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345 Dataset paper: https://arxiv.org/abs/1705.02315

Overview

This PR adds dataset, binary classification task, and multilabel classification task classes for the ChestX-ray14 dataset.

Testing

python -m pyhealth.unittests.test_datasets.test_chestxray14
python -m pyhealth.datasets.chestxray14 --config "pyhealth/datasets/configs/chestxray14.yaml" --download --partial

Dataset (chestxray14.py and chestxray14.yaml)

  1. Optionally downloads the dataset from Box (optionally, just a partial download).
  2. Indexes the image paths to their metadata.
  3. Initializes the BaseDataset class to better integrate with the rest of PyHealth.
  4. Provides a basic __main__ block for manual testing.

Binary Classification Task (chestxray14_binary_classification.py)

  1. Creates labeled samples for any of the 14 diseases in the dataset.

Multilabel Classification Task (chestxray14_multilabel_classification.py)

  1. Creates multilabel samples from the dataset.

Unit Tests (test_chestxray14.py)

  1. Fabricates test data to avoid image downloads.
  2. Tests the public ChestXray14Dataset methods.
  3. Tests two sample datasets generated with ChestXray14BinaryClassification.
    1. One for hernias
    2. The second for Cardiomegaly
  4. Tests a sample dataset generated with ChestXray14MultilabelClassification.

Review Note (source of data download)

The ChestX-ray14 dataset is available from at least two sources, box.com and kaggle. However, these two sources are not identical. The Box version has corrections that have not been copied back to Kaggle (for details see here).

I chose to use Box as the download source, since I currently include the "patient age" in the dataset, which is one of the fields that was corrected. However, I couldn't figure out a way to download the image metadata CSV file from Box (downloads a file preview instead) so I created a publicly available mirror of that file on Google Drive.

If the download from the Google Drive mirror of the Data_Entry_2017_v2020.csv file is not acceptable, I can remove "patient age" from the dataset and use the Kaggle dataset instead. #343 shows an example of downloading ChestX-ray14 from Kaggle.

EricSchrock avatar May 03 '25 18:05 EricSchrock