Add MedNLI Natural Language Inference Dataset and Task to PyHealth

Open AbrahamArellano opened this issue 8 months ago • 0 comments

MedNLI Dataset and Task Implementation for PyHealth

Contributor Information

Name: Abraham Arellano, Umesh Kumar
NetID: aa107, umesh2

Contribution Type

New Dataset: Medical Natural Language Inference (MedNLI)
New Task: Natural Language Inference classification

Description

This PR implements the MedNLI dataset and corresponding classification task for PyHealth. The Medical NLI dataset consists of 14,049 sentence pairs with clinical premises and hypotheses, manually annotated for textual entailment (entailment, contradiction, or neutral).

The implementation follows PyHealth's architectural patterns and supports configurable data fractions, detailed statistics, and integration with existing models.

Files to Review

pyhealth/datasets/mednli_dataset.py: Dataset implementation
pyhealth/datasets/configs/mednli.yaml: Dataset configuration
pyhealth/tasks/mednli_task.py: Task implementation
examples/mednli_example.py: Usage example
Added section to README.rst: Documentation

Implementation Details

Dataset requires MIMIC-III credentials from PhysioNet
Compatible with Python 3.9 (compatibility issues with 3.12+)
Dependency constraints: numpy==1.24.3, pandas<2.0, polars, pydantic>=2.0.0

Testing

The implementation has been tested with full dataset loading, statistics generation, and sample extraction for all 14,049 examples. It properly integrates with PyHealth's task framework.

Research Relevance

This implementation supports reproducing experiments from the "Do We Still Need Clinical LLMs?" paper, enhancing PyHealth's capabilities for clinical language understanding evaluation.

Note: Submitted to the "develop" branch as requested in the guidelines.

May 03 '25 21:05 AbrahamArellano