UDSMProt
UDSMProt copied to clipboard
Protein sequence classification with self-supervised pretraining
UDSMProt, universal deep sequence models for protein classification
UDSMProt is an algorithm for the classification of proteins based on the sequence of amino acids alone. Its key component is a self-supervised pretraining step based on a language modeling task. The model is then subsequently finetuned to specific classification tasks. In our paper we considered enzyme class classification, gene ontology prediction and remote homology detection showcasing the excellent performance of UDSMProt.
For a detailed description of technical details and experimental results, please refer to our paper:
Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek, UDSMProt: universal deep sequence models for protein classification, Bioinformatics 36, no. 8, 2401-2409, 2020.
@article{Strodthoff:2019universal,
author = {Strodthoff, Nils and Wagner, Patrick and Wenzel, Markus and Samek, Wojciech},
title = "{UDSMProt: universal deep sequence models for protein classification}",
journal = {Bioinformatics},
volume = {36},
number = {8},
pages = {2401-2409},
year = {2020},
month = {01},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa003},
}
An earlier preprint of this work is also available at bioRxiv. This is the accompanying code repository, where we also provide links to pretrained language models.
Also have a look at USMPep:Universal Sequence Models for Major Histocompatibility Complex Binding Affinity Prediction that builds on the same framework.
Dependencies
for training/evaluation: pytorch fastai fire
for dataset creation: numpy pandas scikit-learn biopython sentencepiece lxml
Installation
We recommend using conda as Python package and environment manager.
Either install the environment using the provided proteomics.yml by running conda env create -f proteomics.yml or follow the steps below:
- Create conda environment:
conda create -n proteomicsandconda activate proteomics - Install pytorch:
conda install pytorch -c pytorch - Install fastai:
conda install -c fastai fastai=1.0.52 - Install fire:
conda install fire -c conda-forge - Install scikit-learn:
conda install scikit-learn - Install Biopython:
conda install biopython -c conda-forge - Install sentencepiece:
pip install sentencepiece - Install lxml:
conda install lxml
Optionally (for support of threshold 0.4 clusters) install cd-hit and add cd-hit to the default searchpath.
Data
Swiss-Prot and UniRef
- Download and extract the desired Swiss-Prot release (by default we use 2017_03) from the UniProt ftp server. Save the contained
uniprot_sprot.xmlasuniprot_sprot_YEAR_MONTH.xmlin the./datadirectory - Download and extract the desired UniRef release (by default we use 2017_03) from the UniProt ftp server. Save the contained
uniref50.xmlasuniref50_YEAR_MONTH.xmlin the./datadirectory. As an alternative and for full reproducibility, we also provide pickled cluster filescdhit04_uniprot_sprot_2016_07.pklanduniref50_2017_03_uniprot_sprot_2017_03.pklto be placed under./tmp_datathat avoid downloading the full UniRef file or running cd-hit. - Or just call our provided script
./download_swissprot_uniref.sh 2017 03which manages everything for you.
EC prediction
- Preprocessed versions of the DEEPre and ECPred datasets are already contained in the
./git_datafolder of the repository. - The custom EC40 and EC50 datasets will be created from Swiss-Prot data directly.
GO prediction
- Download the raw GO prediction data
data-2016.tar.gzfrom DeepGoPlus and extract it into the./data/deepgoplus_data_2016folder
Remote Homology Detection
- Download the superfamily and fold datasets and extract them into the
./datafolder
Data Preprocessing
- Run the data preparation script
cd code
./create_datasets.sh
- The output is structured as follows:
tok.npysequences as list of numerical indices (mapping is provided bytok_itos.npy)label.npy(if applicable) label as list of numerical indices (mapping is provided bylabel_itos.npy)train_IDs.npy/val_IDs.npy/test_IDs.npynumerical indices identifying training/validation/test set by specifying rows intok.npytrain_IDs_prev.npy/val_IDs_prev.npy/test_IDs_prev.npyoriginal non-numerical IDs for all entries that were ever assigned to the respective sets (used to obtain consistent splits for downstream tasks)ID.npyoriginal non-numerical IDs for all entries intok.npy
- The approach is easily extendable to further downstream classification or regression tasks. It only requires to implement a corresponding preprocessing method similar to the ones provided for the existing tasks in
preprocessing_proteomics.py.
Basic Usage
We provide some basic usage information for the most common tasks:
- Language Model Pretraining (or skip this step and use the provided pretrained LMs (forward and backward models trained on SwissProt 2017_03))
cd code
python modelv1.py language_model --epochs=60 --lr=0.01 --working_folder=datasets/lm/lm_sprot_dirty/ --export_preds=False --eval_on_val_test=True
- Finetuning for enzyme class classification (here for level 1 and EC50 dataset; assuming the pretrained folder is located at
datasets/lm/lm_sprot_uniref_fwd)
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --metrics=["accuracy","macro_f1"] --lr=0.001 --lr_fixed=True --bs=32 --lr_slice_exponent=2.0 --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --export_preds=True --eval_on_val_test=True
- Finetuning for gene ontology prediction
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --lr=0.001 --lr_fixed=True --bs=32 --lin_ftrs=[1024] --lr_slice_exponent=2.0 --metrics=[] --working_folder=datasets/clas_go/clas_go_deepgoplus_2016 --export_preds=True --eval_on_val_test=True
- Finetuning for remote homology detection (here for superfamily level and a single dataset)
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=10 --bs=128 --metrics=["binary_auc","binary_auc50","accuracy"] --early_stopping=binary_auc --bs=64 --lr=0.05 --fit_one_cycle=False --working_folder=datasets/clas_scop/clas_scop0 --export_preds=True --eval_on_val_test=True
The output is logged in logfile.log in the working directory, the final results are exported for convenience as result.npy and individual predictions that can be used for example for ensembling forward and backward models are exported as preds_valid.npz and preds_valid.npz (in case export_preds is set to true).