DeepMetaPSICOV
DeepMetaPSICOV copied to clipboard
Deep ResNet-based protein contact prediction
DeepMetaPSICOV 1.1.0
Deep residual neural networks for protein contact prediction
Shaun M. Kandathil, Joe G. Greener and David T. Jones
University College London
Changes in version 1.1.0
- Changes for compatibility with PyTorch 0.4.0 and above but reading in trained model saved with 0.3.0. The 'reference' version using PyTorch 0.3 is still provided.
- Minor bugfixes in test script.
- Updated documentation.
Requirements:
-
Bash shell
-
C and C++ compilers (tested with GCC 4.4.2, 4.8.5, and 5.4.0)
-
Python 2 or 3 (preferably miniconda/anaconda, as this makes the PyTorch install much easier; tested on miniconda Python3)
-
The following Python modules:
- PyTorch >= 0.4.0 (tested on 1.1.0)
-
Third-party programs:
- HH-suite v3.0+ and a recent UniClust30 database (for making alignments; skip if you will only use pre-made alignments)
- CCMpred v0.1.0 (Need this exact version; available here)
- FreeContact 1.0.21 (available here)
- Legacy BLAST 2.2.26 (executables
blastpgpandmakemat) and a suitable non-redundant database, e.g. Uniref90, formatted usingformatdb(needed to generate PSIPRED and SOLVPRED inputs). Legacy BLAST is available here.
All other required programs written by our group are now bundled in this repo and do not need to be installed separately.
On some distributions, the C++ compiler is a separate add-on package and may not be installed by default. For example, on CentOS you will need to yum install packages gcc AND gcc-c++.
Installing PyTorch
For most conda users, conda install -c pytorch pytorch should work. Alternatively, visit https://pytorch.org/get-started/locally/
Setup and testing:
Build bundled programs
cd src/; make; make install
Specify paths to external dependencies
Edit run_DMP.sh to indicate the paths to the third-party programs listed above, as well as other variables such as the number of threads to use for various programs. User-editable variables are demarcated by comment lines. We do not recommend changing anything outside this region unless you know what you are doing.
Testing
cd test; ./testDMP.sh
The script will use the configuration you have provided in run_DMP.sh and run a test contact prediction.
NB: the test script will not run PSI-BLAST or HHBlits; the script runs only the remaining parts of the DMP pipeline (using running Option 4 below).
Different versions of OSs, compilers etc. can lead to differing contact scores (as well as outputs from the feature generation programs), so we only test the ranking of the top-L predicted contacts against a reference output.
Running:
Run /path/to/run_DMP.sh -h to see the available options. DMP runs a number of programs to generate input features; their outputs are stored in a number of intermediate files. By default, DMP will attempt to reuse any files with the correct filenames (this is useful for debugging and allows you to 'continue' a failed run). You can force regeneration of intermediate files with the --force option.
At a minimum, you must provide the path to a FASTA-formatted target sequence in order to run DMP. There are a few different ways to run it:
Option 1: From sequence only (requires legacy BLAST and HHblits):
/path/to/run_DMP.sh -i input.fasta
Option 2: From sequence and pre-made alignment in PSICOV format (requires legacy BLAST only):
/path/to/run_DMP.sh -i input.fasta -a input.aln
Option 3: From sequence, and PSSM in legacy BLAST makemat format (requires HHblits only):
/path/to/run_DMP.sh -i input.fasta -m input.mtx
Option 4: From sequence, pre-made alignment, and PSSM in legacy BLAST makemat format (does not require BLAST or HHblits):
/path/to/run_DMP.sh -i input.fasta -a input.aln -m input.mtx
Output:
The primary output from the script is a file with the extension .con that has the contact predictions in CASP format, without headers and footers.
A number of intermediate files are also generated during a run, corresponding to the outputs from the various input feature generation programs. These are then composed into a few more files so that the data are in the correct format for the ResNet to use. These latter files have the extensions .map, .21c and .fix.
By default, all of these intermediate files are retained after the run has completed; this behaviour can be changed by specifying --cleanup to run_DMP.sh. When --cleanup is specified, only the PSICOV-formatted MSA (extension .aln), the PSI-BLAST PSSM (extension .mtx) and the .con file are retained. The input .fasta sequence file is never deleted.
Citing:
If you find DeepMetaPSICOV useful, please cite our paper in Proteins: https://onlinelibrary.wiley.com/doi/full/10.1002/prot.25779
FAQs:
Do I have to use the exact versions of the programs that you mention?
Yes. DMP was trained on data output by specific versions of the feature generation programs, and you need to use the same versions during inference.
Your paper mentions a bug in one of the feature generation programs; is this release affected?
The version of alnstats in this repository is not affected by the bug in question. Since we verified that the training of DMP did not suffer from the bug, we are releasing the bug-free version for inference. If for any reason you'd like the buggy version of alnstats, do get in touch.
The version of CCMpred you use is ancient!
We know. Keen users will also have spotted that we use a number of input features in common with MetaPSICOV. We wanted to assess whether we improve over MetaPSICOV, DeepCov etc. using exactly the same training data, where possible.