PSC-CPI copied to clipboard
Code for AAAI 2024 paper "PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction"
Multi-Scale Protein Sequence-Structure Contrasting for Compound-Protein Interaction Prediction (PSC-CPI)
This is a PyTorch implementation of Protein Sequence-sructure Contrasting for CPI Prediction (PSC-CPI), and the code includes the following modules:
Dataset loader (train/val/test)
Four evaluation settings: Seen-Both, Unseen-Compound, Unseen-Protein, and Unseen-Both
Four evaluation metrics: CPI pattern prediction (AUPRC and AUROC) and CPI Strength Prediction (RMSE and PPCs)
Pre-training, fine-tuning, and inference paradigm
Main Requirements
- numpy==1.21.6
- scipy==1.7.3
- torch==1.6.0
- sklearn == 1.0.2
The datasets used in this paper are available in:
- Pre-training, fine-tuning, and inference
- ProteinEmbed_Model() - Learning protein sequence and structure representations
- prot_data_aug() -- Data augmentation on proteins
- loss_inter() -- loss for CPI pattern prediction
- loss_affn() -- loss for CPI strength prediction
- loss_contras() -- loss for (pre-training) multi-scale contrastive learning
- ProteinEmbed_Model() - Learning protein sequence and structure representations
- data_loader() -- Load train, val, and test data (with four evaluation data spilts)
- set_seed() -- Set radom seeds for reproducible results
- cal_affinity_torch() -- Use Pytorch to calculate CPI affinity (RMSE and PPCs)
- cal_interaction_torch() -- Use Pytorch to calculate CPI pattern (AUPRC and AUROC)
Running the code
Install the required dependency packages
To pre-train and fine-tune the model, please run with proper hyperparameters:
python --task_mode 0 --modality seq_str_linear --pre-train 1 --seq_encoder HRNN --str_encoder GAT
where (1) task_mode is one of the two CPI tasks: 0 (Strength Prediction) and 1 (Pattern Prediction); (2) modality is one of the three inference settings: 'seq_str_linear' (both two modalities are provided), 'sequence' (only sequence is provided), and 'structure' (only structure is provided); (3) pre-train denotes whether the pre-training is conducted: 0 (w/o pre-training) and 1 (w/ pre-training); (4) seq_encoder is one of the four protein sequence encoders: HRNN, LSTM, bi-LSTM, and Transformer; and (5) str_encoder is one of the three protein structure encoders: GCN, GAT, and SAGE.
If you find this project useful for your research, please use the following BibTeX entry.
title={Psc-cpi: Multi-scale protein sequence-structure contrasting for efficient and generalizable compound-protein interaction prediction},
author={Wu, Lirong and Huang, Yufei and Tan, Cheng and Gao, Zhangyang and Hu, Bozhen and Lin, Haitao and Liu, Zicheng and Li, Stan Z},
journal={arXiv preprint arXiv:2402.08198},