Adding TUH EEG Data PREPROCESSING Contribution w. Test Cases
Contributors: Rohit Rao - rohit8
Type: Unique Contribution (Preprocessing Pipeline / Reproduction Example)
Description: This PR introduces a complete preprocessing pipeline for the TUH EEG Seizure Corpus (v2.0.3). It serves as a reproduction of the data processing steps used in Lee et al. (2022) ("Real-Time Seizure Detection using EEG") to prepare raw clinical EEG data for deep learning models.
Why this is a unique contribution: Unlike standard PyHealth dataset classes that often expect pre-cleaned data, this contribution provides a reusable pipeline for raw EDF ingestion. It handles:
- Bipolar Montage Conversion: The standard clinical preprocessing step for seizure detection.
- Signal Resampling & Alignment: Matching variable sampling rates to model requirements.
- Annotation Parsing: Aligning
.tseseizure labels with signal windows.
Note on Testing & Data Privacy:
This contribution is specifically for the initial processing of the TUH dataset. I am unable to provide legitimate sample data due to the privacy statements of the data lender and its massive size (>100GB).
To resolve this, I have created a Mock Data Generator (testcases.py) that creates synthetic EDF files and annotations in the exact TUSZ format.
Files to Review:
PyHealth/examples/reproduction_seizure_detection/process_TUH_dataset.py: Main processing logic.PyHealth/examples/reproduction_seizure_detection/testcases.py: Mock data generator & unit tests.PyHealth/examples/reproduction_seizure_detection/README.md: Usage documentation.
How to Test: You can verify the pipeline using the included mock data generator:
# 1. Generate mock EDF/TSE data
python examples/reproduction_seizure_detection/testcases.py --generate_mock_data
# 2. Run the processing pipeline on the mock data
python examples/reproduction_seizure_detection/testcases.py