PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

Add TextExtractionMIMIC4 task for EHR text extraction and embedding generation

Open jimenezzz opened this issue 1 month ago • 0 comments

Contributors

  • mjoan2
  • lu94
  • jesusaj2

Type of Contribution

Task + Example - New task implementation for text extraction from MIMIC-IV EHR data, along with a complete use case example

High-Level Description

We've implemented a new task called TextExtractionMIMIC4 that extracts structured text from MIMIC-IV EHR tables (like labevents and prescriptions) and formats it for use with embedding models. The task is designed to work seamlessly with PyHealth's dataset pipeline and supports configurable field extraction and event filtering.

Context: This implementation is a subtask of the EHRXDiff paper, which focuses on predicting temporal changes in chest X-ray images based on Electronic Health Records. This task serves as a preprocessing step that extracts and structures text from EHR tabular data before generating text embeddings, which are then used as part of the multimodal input for the EHRXDiff framework.

Key features:

  • Configurable field extraction: You can specify which fields to extract from each EHR table (e.g., label, value, category for lab events)

  • Flexible event filtering: Supports both include and exclude filters based on field values (e.g., only keep lab events with category "Blood Gas", or exclude prescriptions containing certain drugs)

  • Visit-aware processing: Extracts text samples within each admission/visit time window, maintaining proper patient and visit associations

  • Default configurations: Comes with sensible defaults for labevents and prescriptions tables, but can be fully customized

The task outputs text samples that can be directly fed into language models (like BioBERT or ClinicalBERT) for generating embeddings. These embeddings are then used as part of the multimodal conditioning in the EHRXDiff framework to predict future chest X-ray images based on previous images and subsequent medical events.

We've also included a complete example/use case (mimic4_embeddings_biobert.py) that demonstrates the full workflow: loading MIMIC-IV data, extracting text using the task, and generating embeddings with BioClinical-ModernBERT. This example serves as a practical guide for users who want to use the task for representation learning or downstream ML tasks.

Files to Review for Testing

  1. Main implementation: PyHealth/pyhealth/tasks/text_extraction_mimic4.py

    • Core task class with all the extraction and filtering logic
  2. Example/Use case: PyHealth/examples/mimic4_embeddings_biobert.py

    • Complete end-to-end example demonstrating the full workflow:
      • Load MIMIC-IV dataset
      • Extract text using the TextExtractionMIMIC4 task
      • Generate embeddings with BioClinical-ModernBERT transformer model
      • Store embeddings with metadata for downstream analysis
    • This example shows a practical use case for representation learning from clinical text
    • Run with: python3 mimic4_embeddings_biobert.py
  3. Unit tests: PyHealth/tests/core/test_text_extraction_mimic4.py

    • Comprehensive test suite covering:
      • Initialization with default and custom configs
      • Field extraction logic
      • Include/exclude filtering
      • Event processing and sample generation
      • Edge cases (missing fields, invalid data, etc.)
    • Run with: python3 test_text_extraction_mimic4.py

The tests verify that the task correctly extracts text, applies filters, handles edge cases, and produces the expected output format with patient_id, visit_id, event_type, and text fields.

jimenezzz avatar Dec 04 '25 04:12 jimenezzz