Add MIMICCXRReportDataset and Radiology Report Example Script for Text-Based Clinical Prediction Tasks

Open TahyunM opened this issue 1 month ago • 0 comments

This pull request introduces a new dataset loader, MIMICCXRReportDataset, which provides structured access to radiology reports derived from the MIMIC-CXR dataset. This contribution expands PyHealth’s support for text-based clinical tasks and enables researchers to easily incorporate free-text radiology reports into prediction pipelines using the existing PyHealth ecosystem.

The primary motivation for this addition is that many modern clinical NLP tasks involve analyzing radiology reports rather than structured codes or tabular EHR data. PyHealth already provides strong support for multimodal modeling and time-series tasks, but until now it did not include a dataset class specifically designed for radiology free-text analysis. The new MIMICCXRReportDataset addresses this gap. It loads a user-provided CSV file containing report IDs, full report text, and document-level abnormality labels, and it maps each report into the standard PyHealth patient–visit abstraction. This makes it possible to use PyHealth’s task utilities, models, batching strategies, and trainer without requiring users to manually write preprocessing or dataset glue code.

The dataset class is intentionally lightweight and flexible. Instead of embedding the raw MIMIC data (which cannot be redistributed due to licensing restrictions), the class relies on a user-supplied CSV file that has already been preprocessed locally. This design follows PyHealth’s conventions for external datasets such as the Synthetic MIMIC-III dataset. The class supports customizable column names for the report ID, the text field, and the label field, allowing it to adapt to a range of user workflows.

Alongside the dataset loader, this pull request includes a runnable example script, example_mimic_cxr_report.py. This example demonstrates an end-to-end workflow in PyHealth using free-text radiology reports. It shows how to:

Load the dataset through the new dataset loader.

Add a text-based abnormality prediction task using add_prediction_task.

Split the dataset into patient-level training, validation, and test sets.

Initialize a BERT-based model from pyhealth.models.

Train and evaluate the model using PyHealth’s Trainer class.

This example illustrates how the new dataset can be seamlessly integrated into PyHealth’s existing framework for supervised learning, enabling researchers to train modern NLP models for radiology report classification with minimal boilerplate.

The implementation has been verified locally using a subsampled version of the MIMIC-CXR report dataset. The dataset loader correctly generates PyHealth patient–visit structures, the task creation mechanism operates as expected, and the example script runs successfully end-to-end on CPU.

This contribution benefits the PyHealth community by enabling a new, widely used domain of clinical text to be incorporated into research pipelines. Radiology reports are essential for many downstream clinical prediction tasks, including abnormality classification, impression summarization, and device or pathology detection. By providing a standardized dataset loader and example, this PR lowers the barrier for students and researchers to experiment with radiology-based NLP tasks in PyHealth.

Dec 05 '25 21:12 TahyunM