EquiPNAS
EquiPNAS copied to clipboard
pLM-informed E(3) equivariant deep graph neural networks for protein-nucleic acid binding site prediction
EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks
by Rahmatullah Roche, Bernard Moussad, Md Hossain Shuvo, Sumit Tarafder, and Debswapna Bhattacharya
published in Nucleic Acids Research
Codebase for our improved protein-nucleic binding site prediction appraoch, EquiPNAS.

Installation
1.) We recommend conda virtual environment to install dependencies for EquiPNAS. The following command will create a virtual environment named 'EquiPNAS'
conda env create -f EquiPNAS_env.yml
2.) Then activate the virtual environment
conda activate EquiPNAS
3.) Download the trained models from here
- For protein-DNA binding site prediction, use models/EquiPNAS-DNA model
- For protein-RNA binding site prediction, use models/EquiPNAS-RNA model
That's it! EquiPNAS is ready to be used.
Usage
To see usage instructions, run python EquiPNAS.py -h
usage: EquiPNAS.py [-h] [--model_state_dict MODEL_STATE_DICT] [--indir INDIR] [--outdir OUTDIR] [--num_workers NUM_WORKERS]
options:
-h, --help show this help message and exit
--model_state_dict MODEL_STATE_DICT
Saved model
--indir INDIR Path to input data containing distance maps and input features (default 'datasets/DNA_test_129_Preprocessing_using_AlphaFold2/')
--outdir OUTDIR Prediction output directory
--num_workers NUM_WORKERS
Number of workers (default=4)
Here is an example of running EquiPNAS:
1.) Input target list and all input files should be inside input preprocessing directory (examples can be found here Preprocessing/). A detailed preprocessing instructions can be found here
2.) Make an output directory mkdir output
3.) Run python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir Preprocessing/ --outdir output/
4.) The residue-level protein-DNA or protein-RNA binding site predictions are generated at output/.
Training
For protein-DNA binding site prediction, we obtain the training targets from here, and for protein-RNA binding site prediction, we obtain the training targets from here. Our full train dataset containing the train code, list, and features for both protein-DNA and protein-RNA combined altogether can be found here. The procedure for training is detailed as follows:
Train scripts
-
Download the train scripts from here
-
Extract the train scripts and move them to the current directory
tar -xzvf train_scripts.tar.gzmv train_scripts/* .
Train model for protein-DNA binding site
To train protein-DNA binding site predictions in your own dataset, input train target list and all input files should be inside the train data directory and can be preprocessed as described earlier here. Example train data for protein-DNA binding site prediction can be found here.
To retrain the protein-DNA binding site prediction model with our dataset, download the train features and data from here.
-
Extract the train features
tar -xzvf DNA_train_data.tar.gz -
Run the train scripts:
python train_model.py --indir DNA_train_data/ --save_dir model/DNA/
The trained model will be saved inside: model/DNA
Train model for protein-RNA binding site
To train protein-RNA binding site predictions in your own dataset, input train target list and all input files should be inside the train data directory and can be preprocessed as described earlier here Example train data for protein-RNA binding site prediction can be found here.
To retrain the protein-RNA binding site prediction model with our dataset, download the train features and data from here.
-
Extract the train features
tar -xzvf RNA_train_data.tar.gz -
Run the train scripts:
python train_model.py --indir RNA_train_data/ --save_dir model/RNA
The trained model will be saved inside: model/RNA/
Test set benchmarking
For protein-DNA binding site prediction, we obtain the test targets for Test_129 from here, and for Test_181 from here For protein-RNA binding site prediction, we obtain the test targets from here. Our full test dataset containing the test list and features for all the benchmarking datasets can be found here. The procedure for test set benchmarking is detailed as follows:
Pretrained model
-
First download the trained models from here
-
Extract the models
tar -xzvf models.tar.gz
Protein-DNA
Test_129
Prediction using AlphaFold2 predicted structural models
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_129_Preprocessing_using_AlphaFold2.tar.gz -
Create output prediction directory
mkdir outputs/DNA_test_129_predictions_using_AlphaFold2/ -
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_129_Preprocessing_using_AlphaFold2/ --outdir outputs/DNA_test_129_predictions_using_AlphaFold2/
Prediction using experimental structures
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_129_Preprocessing_using_native.tar.gz -
Create output prediction directory
mkdir outputs/DNA_test_129_predictions_using_native/ -
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_129_Preprocessing_using_native/ --outdir outputs/DNA_test_129_predictions_using_native/
Test_181
Prediction using AlphaFold2 predicted structural models
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_181_Preprocessing_using_AlphaFold2.tar.gz -
Create output prediction directory
mkdir outputs/DNA_test_181_predictions_using_AlphaFold2/ -
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_181_Preprocessing_using_AlphaFold2/ --outdir outputs/DNA_test_181_predictions_using_AlphaFold2/
Prediction using experimental structures
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_181_Preprocessing_using_native.tar.gz -
Create output prediction directory
mkdir outputs/DNA_test_181_predictions_using_native/ -
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_181_Preprocessing_using_native/ --outdir outputs/DNA_test_181_predictions_using_native/
Protein-RNA
Test_117
Prediction using AlphaFold2 predicted structural models
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf RNA_test_117_Preprocessing_using_AlphaFold2.tar.gz -
Create output prediction directory
mkdir outputs/RNA_test_117_predictions_using_AlphaFold2/ -
Run EquiPNAS prediction using the pretrained protein-RNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-RNA/E-l12-768.pt --indir RNA_test_117_Preprocessing_using_AlphaFold2/ --outdir outputs/RNA_test_117_predictions_using_AlphaFold2/
Prediction using experimental structures
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf RNA_test_117_Preprocessing_using_native.tar.gz -
Create output prediction directory
mkdir outputs/RNA_test_117_predictions_using_native/ -
Run EquiPNAS prediction using the pretrained protein-RNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-RNA/E-l12-768.pt --indir RNA_test_117_Preprocessing_using_native/ --outdir outputs/RNA_test_117_predictions_using_native/