Adding TUH EEG Data PREPROCESSING Contribution w. Test Cases

Open Rohit-R-Rao opened this issue 1 month ago • 0 comments

Contributors: Rohit Rao - rohit8

Type: Unique Contribution (Preprocessing Pipeline / Reproduction Example)

Description: This PR introduces a complete preprocessing pipeline for the TUH EEG Seizure Corpus (v2.0.3). It serves as a reproduction of the data processing steps used in Lee et al. (2022) ("Real-Time Seizure Detection using EEG") to prepare raw clinical EEG data for deep learning models.

Why this is a unique contribution: Unlike standard PyHealth dataset classes that often expect pre-cleaned data, this contribution provides a reusable pipeline for raw EDF ingestion. It handles:

Bipolar Montage Conversion: The standard clinical preprocessing step for seizure detection.
Signal Resampling & Alignment: Matching variable sampling rates to model requirements.
Annotation Parsing: Aligning .tse seizure labels with signal windows.

Note on Testing & Data Privacy: This contribution is specifically for the initial processing of the TUH dataset. I am unable to provide legitimate sample data due to the privacy statements of the data lender and its massive size (>100GB). To resolve this, I have created a Mock Data Generator (testcases.py) that creates synthetic EDF files and annotations in the exact TUSZ format.

Files to Review:

PyHealth/examples/reproduction_seizure_detection/process_TUH_dataset.py: Main processing logic.
PyHealth/examples/reproduction_seizure_detection/testcases.py: Mock data generator & unit tests.
PyHealth/examples/reproduction_seizure_detection/README.md: Usage documentation.

How to Test: You can verify the pipeline using the included mock data generator:

# 1. Generate mock EDF/TSE data
python examples/reproduction_seizure_detection/testcases.py --generate_mock_data

# 2. Run the processing pipeline on the mock data
python examples/reproduction_seizure_detection/testcases.py

Dec 07 '25 05:12 Rohit-R-Rao