Add MedNLI Natural Language Inference Dataset and Task to PyHealth
MedNLI Dataset and Task Implementation for PyHealth
Contributor Information
- Name: Abraham Arellano, Umesh Kumar
- NetID: aa107, umesh2
Contribution Type
- New Dataset: Medical Natural Language Inference (MedNLI)
- New Task: Natural Language Inference classification
Description
This PR implements the MedNLI dataset and corresponding classification task for PyHealth. The Medical NLI dataset consists of 14,049 sentence pairs with clinical premises and hypotheses, manually annotated for textual entailment (entailment, contradiction, or neutral).
The implementation follows PyHealth's architectural patterns and supports configurable data fractions, detailed statistics, and integration with existing models.
Files to Review
-
pyhealth/datasets/mednli_dataset.py: Dataset implementation -
pyhealth/datasets/configs/mednli.yaml: Dataset configuration -
pyhealth/tasks/mednli_task.py: Task implementation -
examples/mednli_example.py: Usage example - Added section to
README.rst: Documentation
Implementation Details
- Dataset requires MIMIC-III credentials from PhysioNet
- Compatible with Python 3.9 (compatibility issues with 3.12+)
- Dependency constraints: numpy==1.24.3, pandas<2.0, polars, pydantic>=2.0.0
Testing
The implementation has been tested with full dataset loading, statistics generation, and sample extraction for all 14,049 examples. It properly integrates with PyHealth's task framework.
Research Relevance
This implementation supports reproducing experiments from the "Do We Still Need Clinical LLMs?" paper, enhancing PyHealth's capabilities for clinical language understanding evaluation.
Note: Submitted to the "develop" branch as requested in the guidelines.