Add COVID Red Dataset
Add COVID-RED Dataset, Detection/Prediction Tasks, and Example
Summary
This PR adds support for the COVID-RED (Remote Early Detection of SARS-CoV-2 infections) dataset to PyHealth, including:
- A wearable device dataset loader (
COVIDREDDataset) - Classification task functions (
covidred_detection_fn,covidred_prediction_fn) - A runnable usage example (
covidred_example.py)
This provides a clinically relevant wearable device dataset for PyHealth users and supports reproducible research in early infectious disease detection using consumer wearables.
Feature
1. COVIDREDDataset
- Loads wearable device data (heart rate, steps, sleep) from the COVID-RED study
- Returns unified time series format consistent with PyHealth signal datasets
- Supports multiple data splits:
split="train" | "test" | "all" - Configurable sliding window approach with
window_daysparameter - Two task modes:
- Detection: Classify COVID-19 positive vs negative during illness period
- Prediction: Early detection - predict COVID-19 onset before symptom appearance (1-14 days pre-symptomatic)
- Automatic train/test split with reproducible random seed
- Feature extraction from multivariate time series:
- Resting heart rate statistics (mean, std, min, max)
- Activity metrics (total steps, mean hourly steps)
- Sleep metrics (duration, efficiency)
2. Task Functions
covidred_detection_fn
Maps dataset samples into PyHealth task format for COVID-19 detection:
{
"patient_id": str,
"visit_id": str,
"signal": Tensor(n_features × window_days),
"label": int(0 or 1),
"metadata": dict
}
covidred_prediction_fn
Maps dataset samples for early COVID-19 prediction (pre-symptomatic detection):
- Identifies patterns 1-14 days before symptom onset
- Critical for early intervention and transmission reduction
- Same output format as detection task
covidred_multiclass_fn (optional extension)
Extends to multiclass severity classification:
- 0: COVID-19 negative
- 1: Mild (recovered at home, no assistance)
- 2: Moderate (recovered at home with assistance)
- 3: Severe (hospitalized)
3. Example Script
- Demonstrates complete pipeline: loading → task definition → LSTM classifier training
- Implements bidirectional LSTM with attention to temporal patterns
- Includes proper evaluation metrics (accuracy, precision, recall, F1, AUC)
- Handles class imbalance with weighted loss
- Saves best model based on F1-score
- Serves as a minimal reproducible example for users
Dataset Details
Dataset: COVID-RED - Remote Early Detection of SARS-CoV-2 infections
Source: Utrecht University, Netherlands
DOI: 10.34894/FW9PO7
URL: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/FW9PO7
Data characteristics:
- Wearable device measurements (Fitbit, Apple Watch, Garmin)
- Daily aggregated metrics: heart rate, steps, sleep
- COVID-19 test results and symptom onset dates
- Longitudinal data across pandemic period
- Focus on pre-symptomatic and asymptomatic detection
Clinical significance:
- Early detection 1-14 days before symptom onset
- Enables early intervention and isolation
- Reduces community transmission
- Demonstrates utility of consumer wearables for public health surveillance
Tests
Basic verification performed:
- Dataset loads correctly from CSV files
- Train/test split works as expected (70/30 split)
- Both detection and prediction task functions output PyHealth-compliant dictionaries
- Example script runs end-to-end (CPU/GPU tested)
- Feature extraction handles missing values appropriately
- Label distribution matches expected class imbalance
- LSTM model architecture validated with sample data
Note on Dataset Download
The COVID-RED dataset must be manually downloaded from DataverseNL.
Users must:
- Visit: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/FW9PO7
- Download required files:
heart_rate.csv- Daily resting heart rate measurementssteps.csv- Daily step countssleep.csv- Daily sleep duration and efficiencylabels.csv- COVID-19 test results and symptom dates
- Place files in a directory (e.g.,
/data/covidred/) - Initialize dataset:
from pyhealth.datasets import COVIDREDDataset dataset = COVIDREDDataset(root="/data/covidred/", split="train", task="prediction")
Usage Example
from pyhealth.datasets import COVIDREDDataset
from pyhealth.tasks import covidred_prediction_fn
from torch.utils.data import DataLoader
# Load dataset for early COVID-19 prediction
dataset = COVIDREDDataset(
root="/path/to/covidred",
split="train",
window_days=7,
task="prediction"
)
# Apply task function
samples = [covidred_prediction_fn(dataset[i]) for i in range(len(dataset))]
# Create dataloader
dataloader = DataLoader(samples, batch_size=32, shuffle=True)
# Train your model
for batch in dataloader:
signals = batch["signal"] # Shape: (batch_size, n_features, window_days)
labels = batch["label"] # Shape: (batch_size,)
# ... training code
Files Changed
This PR adds three new files to PyHealth:
pyhealth/datasets/covidred.py- Dataset loader classpyhealth/tasks/covidred.py- Task functions for COVID-19 detection/predictionexamples/covidred_example.py- Complete usage example with LSTM classifier
Citation
If you use this dataset implementation, please cite the original COVID-RED study:
@data{FW9PO7_2021,
author = {Olthof, A.W. and Schut, A. and van Beijnum, B.F. and others},
publisher = {DataverseNL},
title = {{Remote Early Detection of SARS-CoV-2 infections (COVID-RED)}},
year = {2021},
version = {V1},
doi = {10.34894/FW9PO7},
url = {https://doi.org/10.34894/FW9PO7}
}