BERTMHC
BERTMHC copied to clipboard
MHC-peptide class II interaction prediction, binding, presentation
BERTMHC
Predict peptide MHC binding and presentation with BERT model. Paper: BERTMHC: Improves MHC-peptide class II interaction prediction with transformer and multiple instance learning
Licence
The code is only allowed for accedemic research. Commercial usage/research is not granted. Before using the code, please make sure you read and agree with the Licence
Installation
The package can be installed with pip
. In the root directory of this repo:
pip install .
Training and prediction
The model can be trained with bertmhc train
.
bertmhc train --help
An example input data format is provided in tests/data/{train,eval}.csv
.
Training the binding model
To train a binding model, it is important to set --alpha 0
.
See example input file tests/data/train.csv
. The required columns are [allele, mhc, peptide, label]
.
bertmhc train --lr 0.15 --batch_size 64 --alpha 0 --wd 0.0
--peplen 24 --epochs 30
--data <data folder>
--train <train.csv.gz>
--eval <eval.csv.gz>
--save <model.pt>
Training the presentation model
To train a presentation model with multiple alleles setting, the data need to be process as test/data/train.csv
.
Specifically, a group_index
column of integers and a MA
column of boolean are required.
The MA
column indicates whether the sample is from multi-allele or single allele. The group_index
column use
integer values to track which alleles belonging to the same bag. Consider the following multi-allele data:
allele1, allele2 peptide1 1
allele1, allele3 peptide2 0
It needs to be expanded as:
allele peptide masslabel MA group_index
allele1 peptide1 1 True 0
allele2 peptide1 1 True 0
allele1 peptide2 0 True 1
allele3 peptide2 0 True 1
After preparing the data, the presentation model can be trained with:
bertmhc train --lr 0.001 --batch_size 64 --alpha 0 --wd 0.0001 --deconvolution True
--metric val_ap --peplen 24 --epochs 30 --sa_epoch 15
--data <data folder>
--train <train.csv.gz>
--eval <eval.csv.gz>
--save <model.pt>
--sa_epoch
is the number of epochs to train first on the SA data only. Use this if the input data consist of both SA and MA samples
(distinguished by the MA
column in the input data).
Prediction
After model training, to predict with trained models, use bertmhc predict
. The required columns are [allele, mhc, peptide]
.
bertmhc predict --data <test.csv.gz>
--model <trained_model.pt>
--peplen 24
--batch_size 64
--task <binding,presentation> # use 'binding' or 'presentation'
--output <output.csv.gz>
Webserver
We provide a webserver to run our trained models described in the paper. To use the webserver, please read and accept the terms of use.
https://bertmhc.privacy.nlehd.de/
You can submit maximum of 5k peptides for each query. The server might return error when overloaded. Please try again later if it does not work temporarily. Please feel free to open a github issue if you think the server is not running properly.