DEPRECATED: This repository was used to report early results from the COVI contact tracing project. It is no longer maintained.

Please see the current simulator here, and/or the machine learning repo here.

Supervised COVID-19 risk prediction

As part of a project for creating a COVID-19 risk-management app, we have created a supervised learning dataset for predicting individuals' level of risk of infection, as well as their source of infection, from features of individuals (e.g. pre-existing medical conditions) and features of encounters between individuals. The dataset is output by a city-level simulator (a stochastic agent-based model).

The goal of providing this dataset is to find machine learning models (or any method!) which can do a good job of predicting risk and sources of infection from the provided features. The features are constrained by many concerns about privacy and security, making ordinary contact-tracing impracticable; this is why we need to train the predictors on simulated data. The simulated data is parsed to 'look like' the real data that would eventually be gathered by the app. The best risk estimator(s) will be used in an app to provide personalized recommendations and interventions. There is potential for these targetted interventions to reduce the spread of COVID-19 much more effectively than generic social distancing or other measures.

This repo contains pytorch dataloaders and a Transformer model; you can start from these and replace the Transformer with your own model, or use them as inspiration for development with another framework. Upload your results to the table by making a PR (details below).

IMPORTANT: Do not train/tune on the test set, optimize for any of the metrics, or otherwise attempt to "cheat" at the task. This is not a contest. This project has real-world applications; under-estimating risk due to poor generalization/over-fitting could be dangerous. We will keep a private test set to check for this, but is extremely important to use all machine learning best-practices, and it is everyone's individual responsibility to do so to the best of their ability.

Quick Start / Overview

Clone or fork this repo
Download some data (here's a train/val of 1k people for 60 days)
Extract the data to a folder called data inside the repo : unzip covi-1k-04-27.zip data
Install dependencies (see below) and mkdir exp
Run the transformer on CPU to make sure everything is working python train.py exp/MY-CTT-EXPERIMENT-0 --inherit base_config/CTT-0 --config.device cpu
Replace the transformer model with your own and start experimenting! You are also welcome to change/improve the transformer model.
Upload your public test-set results to the results table below by making a PR to this repo

More information

Dataset details

Dataset ID	Train/Val	Public Test	Train Seeds	Val Seeds	Test Seeds	Clustering Type	Target Risk Predictor	Simulator Version	Risk Prediction Version	Mobility Level	App Adoption	Population	Duration (days)
V1-1K	Train/Val	Public Test	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	e302ecfa1fe	bbb4124b	Low	50%	1,000	60
V1-1K	Train/Val	Public Test	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	e302ecfa1fe	bbb4124b	Low	100%	1,000	60
V1-1K	Train/Val	Public Test	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	e302ecfa1fe	bbb4124b	High	50%	1,000	60
V1-1K	Train/Val	Public Test	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	e302ecfa1fe	bbb4124b	High	100%	1,000	60
V1-50K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	Low	50%	50,000	60
V1-50K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	Low	100%	50,000	60
V1-50K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	High	50%	50,000	60
V1-50K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	High	100%	50,000	60
V2-1K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	Low	50%	1,000	60
V2-1K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	Low	100%	1,000	60
V2-1K	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	High	50%	1,000	60
V2-1k	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	High	100%	1,000	60
V2-50k	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	Low	50%	50,000	60
V2-50k	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	Low	100%	50,000	60
V2-50k	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	High	50%	50,000	60
V2-50k	TBD	TBD	0-6	7-9	10-12	Heuristic	Naive First-Order Contact Tracing	TBD	TBD	High	100%	50,000	60

Extract the provided zip file into \data.

unzip <data-file-name>.zip data

Dependencies

To install the dependencies, simply pip install -r requirements.txt and you're all set.

Train the transformer model

Make an experimental directory: mkdir exp

Run the training script, logging to exp/:

python train.py exp/MY-CTT-EXPERIMENT-0 --inherit base_config/CTT-0

where MY-CTT-EXPERIMENT-0 is the experiment directory and you may call it anything that makes sense to you.

The above command will start training a transformer model on a GPU if available, and dump the tensorboard logs in exp/MY-CTT-EXPERIMENT-0/Logs. If you have set up Weights and Biases, simply append --config.wandb.use True to the command. If you do not want to use GPU even if you have one available, append --config.device cpu.

To get a sense of the hyperparameters you can tune, take a look inside base_configs/CTT-0/Configurations/train_config.yml.

Train your own model

Replace the models.py with your own if you want to use this code as a scaffold. Feel free to use only the data loaders and metrics and write your own main loop etc., but we may be slower to evaluate your PR the more different it is from this code.

Task Details

This is framed as a supervised learning task, with the following inputs, targets, and metrics:

Inputs and targets are described for 1 data example (1 person), from the point of view of Alice, who has encounters with many Bobs.

Input:

reported_symptoms: (14, N_s) array where N_s is the number of possible symptoms. Each (i,j) element of the array is a binary indicator of whether Alice had symptom j at day t-i
test_results (14) array of values {-1,0,1} where the ith element indicates the test result at day t-i. A value of -1 means tested negative, 0 means not tested, and 1 means tested positive.
candidate_encounters: (N_e, 3) array where N_e is the number of encounters in the past 14 days. For each encounter, the 3 dimensions are:
- ID: an integer indicating the identity of the Bob in the encounter (estimated by a clustering algorithm)
- encounter_risk: [0-15] the discretized risk level of the Bob in the encounter, 0 is the lowest level of risk and 15 is the highest. These risks are estimated by a model; current datasets use a naive contact-tracing calculation based on the number of Bobs encountered.
- day [0-13] day of the encounter, 0 being today and 13 being 14 days ago

Targets:

Classification of infection status:
- Binary variable 0/1 of whether Alice is infected
- (N_e) binary array 0/1 for each encounter, where the target is 1 if Alice was infected by that Bob encounter and 0 if they are were not
Regression of personal infectiousness: (14) array of floats of Alice's infectiousness for each of the past 14 days

Metrics:

L: Lift is the gain in predictive power of the top 1% over taking a random sample, i.e. the # infected in top 1% divided by 1% of (#infected / population size)
P: Precision is of the top 1% of highest-risk people, what % are correctly identified as being infected
P-U: Precision-Untested is of the top 1% of highest-risk people, excluding those who have a positive test, what % are correctly identified as being infected
P-A: Precision-Asymptomatic is of the top 1% of highest-risk people, excluding those who have a positive test and those who have symptoms, what % are correctly identified as being infected
R: Recall is what % of those infected are correctly identified as being infected
R-U: Recall-Untested is what % of those infected are correctly identified as being infected, among people who have not been tested
R-A Recall-Asymptomatic is what % of those infected are correctly identified as being infected, among people who are asymptomatic
MSE: is Mean Squared Error, between the target risk and the prediction for the infectiousness of each person. (Possibly N/A for non-ML methods)
MRR: is Mean Reciprocal Rank TODO

Results

V1-1K

Model Name	Brief description	ML?	P	P-U	P-A	R	R-U	R-A	MSE	MRR
Naive Contact Tracing	Simple risk calculation based on number of contacts	No	-	-	-	-	-	-	-	-
Transformer	Uses attention over last 14 days of encounters	Yes	-	-	-	-	-	-	-	-

Reporting Results

To report results in the leaderboard submit a pull request from your repo to the master branch of this repo:

Place your row at the appropriate height so that the table is sorted by performance on the first metric (Precision)
You must fill all fields in the leaderboard row (except metrics which do not apply to your method:
- Model name (which is a link to your repo)
- One-line description of your model
- Whether your method employs machine learning (yes/no)
- Metrics (all Precision and Recalls)
Make sure your repo has a brief description of your model in the README.md
The repo making the PR should contain all of your code, which must be open-source (not private)
Tag @teganmaharaj and @nasimrahaman as reviewers of your PR

Links to other parts of this risk-management project

The next stage for a successful risk predictor is to be integrated into the loop of the simulator to test different intervention strategies based on the predicted risk. There is also the potential to jointly train both the simulator and risk predictor. Projects linked below explore both of these possibilities.

Simulator: Generates the underlying data, using an epidemiolgically-informed agent-based model written in Simpy
Data parser: Takes the logs from the Simulator and generates the supervised learning dataset, masking/discretizing some values in accordance with privacy concerns.
Transformer model: Full version of the model the example code in here is based on
GraphNN model: Implementation of a graph neural network for joint risk prediction and simulator optimizatio
coVAEd model: Variational auto-encoder approach for joint risk prediction and model optimization
Decentralized Loopy Belief: Training of a graphical model of disease propagation via message-passing

covid_p2p_risk_prediction
covid_p2p_risk_prediction copied to clipboard

Metadata

DEPRECATED: This repository was used to report early results from the COVI contact tracing project. It is no longer maintained.

Supervised COVID-19 risk prediction

Quick Start / Overview

More information

Dataset details

Dependencies

Train the transformer model

Train your own model

Task Details

Results

V1-1K

Reporting Results

Links to other parts of this risk-management project

← Metadata

Owner

Metadata

covid_p2p_risk_prediction covid_p2p_risk_prediction copied to clipboard

Metadata

DEPRECATED: This repository was used to report early results from the COVI contact tracing project. It is no longer maintained.

Supervised COVID-19 risk prediction

Quick Start / Overview

More information

Dataset details

Dependencies

Train the transformer model

Train your own model

Task Details

Results

V1-1K

Reporting Results

Links to other parts of this risk-management project

← Metadata

Owner

Metadata

covid_p2p_risk_prediction
covid_p2p_risk_prediction copied to clipboard