PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

Add MedNLI Natural Language Inference Dataset and Task to PyHealth

Open AbrahamArellano opened this issue 8 months ago • 0 comments

MedNLI Dataset and Task Implementation for PyHealth

Contributor Information

  • Name: Abraham Arellano, Umesh Kumar
  • NetID: aa107, umesh2

Contribution Type

  • New Dataset: Medical Natural Language Inference (MedNLI)
  • New Task: Natural Language Inference classification

Description

This PR implements the MedNLI dataset and corresponding classification task for PyHealth. The Medical NLI dataset consists of 14,049 sentence pairs with clinical premises and hypotheses, manually annotated for textual entailment (entailment, contradiction, or neutral).

The implementation follows PyHealth's architectural patterns and supports configurable data fractions, detailed statistics, and integration with existing models.

Files to Review

  • pyhealth/datasets/mednli_dataset.py: Dataset implementation
  • pyhealth/datasets/configs/mednli.yaml: Dataset configuration
  • pyhealth/tasks/mednli_task.py: Task implementation
  • examples/mednli_example.py: Usage example
  • Added section to README.rst: Documentation

Implementation Details

  • Dataset requires MIMIC-III credentials from PhysioNet
  • Compatible with Python 3.9 (compatibility issues with 3.12+)
  • Dependency constraints: numpy==1.24.3, pandas<2.0, polars, pydantic>=2.0.0

Testing

The implementation has been tested with full dataset loading, statistics generation, and sample extraction for all 14,049 examples. It properly integrates with PyHealth's task framework.

Research Relevance

This implementation supports reproducing experiments from the "Do We Still Need Clinical LLMs?" paper, enhancing PyHealth's capabilities for clinical language understanding evaluation.

Note: Submitted to the "develop" branch as requested in the guidelines.

AbrahamArellano avatar May 03 '25 21:05 AbrahamArellano