deepDiagnosis
deepDiagnosis copied to clipboard
A torch package for learning diagnosis models from temporal patient data.
deepDiagnosis
A torch package for learning diagnosis models from temporal patient data.
For more details please check:
- http://arxiv.org/abs/1608.00647
Narges Razavian, Jake Marcus, David Sontag,"Multi-task Prediction of Disease Onsets from Longitudinal Lab Tests", Machine Learning and Healthcare, 2016
- http://arxiv.org/abs/1511.07938
Narges Razavian, David Sontag, "Temporal Convolutional Neural Networks for Diagnosis from Lab Tests", ICLR 2016 Workshop track.
#Installation:
The package has the following dependencies:
LUA: Torch, cunn, nn, cutorch, gnuplot, optim, and rnn
#Usage:
Run the following in order. Creating datasets can be done in parallel over train/test/valid tasks. Up to you.
There are sample input files (./sample_python_data) that you can use to test the package first.
1) python create_torch_tensors.py --x sample_python_data/xtrain.pkl --y sample_python_data/ytrain.pkl --task 'train' --outdir ./sampledata/
2) python create_torch_tensors.py --x sample_python_data/xtest.pkl --y sample_python_data/ytest.pkl --task 'test' --outdir ./sampledata/
3) python create_torch_tensors.py --x sample_python_data/xvalid.pkl --y sample_python_data/yvalid.pkl --task 'valid' --outdir ./sampledata/
4) th create_batches.lua --task=train --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
5) th create_batches.lua --task=valid --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
6) th create_batches.lua --task=scoretrain --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
7) th create_batches.lua --task=test --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
8) th train_and_validate.lua --task=train --input_batch_dir=./sampleBatchDir --save_models_dir=./sample_models/
Once the model is trained, run the following to get final evaluations on test set: (change the "lstm2016_05_29_10_11_01" into the model directory that you have created in step 8. Training directories have timestamp.)
9) th train_and_validate.lua --task=test --validation_dir=./sample_models/lstm2016_05_29_10_11_01/
Read the following for details on how to define your cohort and task.
#Input: Input should be one of the two formats described below:

Read below for the details:
Format 1) Python nympy arrays (also support cPickle) of size
xtrain, xvalid, xtest: |labs| x |people| x |cohort time| for creating the input batches
ytrain, yvalid, ytest: |diseases| x |people| x |cohort time| for creating the output batches and inclusion/exclusion for each batch member
Format 2) Python numpy arrays (also support cPickle) of size
xtrain, xvalid, xtest: |Labs| x |people| x |cohort time| for the output
ytrain, yvalid, ytest: |diseases| x |people| for the output, where we do not have a concept of time.
(Note that in format 2 you can also provide exclusion-per-disease for input. If you need that version, let me know and I'll update that part immediately.)
Format 3) advanced shelve databases, for our internal use.
Please refer to https://github.com/clinicalml/ckd_progression for details.
#Prediction Models:
Currently the following models are supported. The details of the architectures are included in the citation paper below.
- Logistic Regression (--model=max_logit)

- Feedforward network (--model=mlp)

- Temporal Convolutional neural network over a backward window (--model=convnet)

- Convolutional neural network over input and time dimension (--model=convnet_mix)

- Multi-resolution temporal convolutional neural network (--model=multiresconvnet)

- LSTM network over the backward window (--model=lstmlast) (note: a version --model=lstmall is also available but we found training with lstmlast gives better results)

- Ensemble of multiple models (to be added soon)
#Synthetic Input for testing the package
You can use the following to create synthetic numpy arrays to test the package;
python create_synthetic_data.py --outdir ./sample_python_data --N 6000 --D 15 --T 48 --O 20
This code will create 3 datasets (train, test, valid) in the ./sample_python_data directory, with dimensions of: 5 x 2000 x 48 for each input x (xtrain, xtest, xvalid) and 20 x 2000 x 48 for each outcome set y. This synthetic data corresponds to input type 1 above. Follow steps 1-9 in the (Run) section above to test with this data, and feel free to test with other synthetic datasets.
#Citation: @article{razavian2016temporal, title={Multi-task Prediction of Disease Onsets from Longitudinal Lab Tests}, author={Razavian, Narges and Marcus,Jake and Sontag, David}, journal={1st Conference on Machine Learning and Health Care (MLHC)}, year={2016} }
@article{razavian2015temporal,
title={Temporal Convolutional Neural Networks for Diagnosis from Lab Tests},
author={Razavian, Narges and Sontag, David},
journal={arXiv preprint arXiv:1511.07938},
year={2015}
}
#Bug reports, questions, and Contact:
For any questions please email: narges razavian [[email protected] or https://github.com/narges-rzv/]