Local generation of .m8 file causes wrong template selection in colabfold_batch
Expected Behavior
This is my input.csv file:
id,sequence
heterodimer_2,MAAEAWRSRFRERVVEAAERWESVGESLATALTHLKSPMHAGDEEEAAAARTRIQLAMGELVDASRNLASAMSLMKVAELLALHGGSVNPSTHLGEISLLGDQYLAERNAGIKLLEAGKDARKAYISVDGCRGNLDAILLLLDHPRVPCVDDFIEEELFVAGDNLQGAIGNAKLGTERAVGARQDVSGAN:MDAAVAGQHARRRIRPPEPLVMAGSPSTPAAFRCPISLEVMRSPVSLPTGATYDRASIQRWLDTGHRTCPATRLPLASTDLVPNLLLRRLIHLHAATLPPSPSPEVVLSQLAAAGGEPAAAEKAVRSLAAKIAPEKGKRASVASAVAADLDSAVPALLSFAKGGAGADARVDAVRILATVAPELVPYLTGDGTEKRGRVRMAVEALAAVLSADGVGEDTKEGLIAALVAGDLGHIVNTLIAAGANGVMVLETILTSPVPDADAKTAIADRSELFPDLVRILKDAASPAAIRCMAAAVQVRGRPARSSMVRAGAIPALALAVAAAPTAVAESALGLLVEAARCTDGKAAIGADAAEVAAAVMGRMIRVGPAGREFAVAVLWLSCCAGGGDRRMREAVASAPEAVGKLLVVMQGDCSPSTSRMAGELLRAVRMEQERKGLAAAYDSRTIHVMPY
Running colabfold_batch --templates --amber --use-gpu-relax input.csv output yields the following .m8 file and template_domain_names.json file:
First 10 lines of .m8:
101 7x8v_E 1.000 184 0 0 4 187 1 184 1.956E-67 228 184M
102 7c96_B 0.528 70 33 0 26 95 1 70 6.214E-22 103 70M
102 1t1h_A 0.520 75 36 0 24 98 1 75 1.968E-20 98 75M
102 1t1h_A 0.520 75 36 0 24 98 1 75 1.968E-20 98 75M
102 1t1h_A 0.520 75 36 0 24 98 1 75 1.968E-20 98 75M
102 1t1h_A 0.520 75 36 0 24 98 1 75 1.968E-20 98 75M
102 2f42_A 0.347 72 47 0 27 98 63 134 5.527E-15 81 72M
102 2oxq_D 0.333 72 48 0 27 98 1 72 1.714E-14 79 72M
102 2c2v_U 0.367 68 43 0 27 94 1 68 1.714E-14 79 68M
102 2c2v_S 0.367 68 43 0 27 94 1 68 1.714E-14 79 68M
template_domain_names.json:
{"A": ["7x8v_E", "7x8v_E"], "B": ["7c96_B", "2c2l_C", "2c2l_D", "1t1h_A", "2f42_A", "2oxq_D", "2c2v_T", "2c2v_U", "2c2v_S", "2c2v_V", "7bbd_B", "6fga_E", "5olm_B", "8a58_D", "6fga_H", "6fga_D", "6s53_H"]}
This is to be expected I guess.
Current Behavior
When running everything locally like this:
input_file="input.csv"
DATABASE_PATH="/data/gpfs/datasets/mmseqs/uniref30_2302"
colabfold_search \
--use-env 1 \
--use-templates 1 \
--db-load-mode 2 \
--mmseqs /apps/easybuild-2022/easybuild/software/MPI/GCC/11.3.0/OpenMPI/4.1.4/MMseqs2/15-6f452/bin/mmseqs \
--db2 pdb100_230517 \
--threads 8 \
${input_file} \
${DATABASE_PATH} \
msas
LOCALPDBPATH="/data/scratch/datasets/alphafold/v2.3.2/pdb_mmcif/mmcif_files"
RANDOMSEED=0
PDBHITFILE="heterodimer_2_pdb100_230517.m8"
# Run the colabfold_batch command
colabfold_batch \
--amber \
--templates \
--use-gpu-relax \
--pdb-hit-file msas/${PDBHITFILE} \
--local-pdb-path ${LOCALPDBPATH} \
--random-seed ${RANDOMSEED} \
msas/heterodimer_2.a3m \
output
I get the same .m8 file as above, however, the template_domain_names.json file is different and contains templates for A that are not in the .m8 file:
{"A": ["7x8v_A", "7x8v_A", "7x8v_A", "7x8v_A", "7x8v_A", "7x8v_A", "6s53_I", "6s53_K", "7bbd_D"], "B": ["1t1h_A", "2f42_A", "7c96_B", "2c2l_C", "2c2l_B", "2c2l_A", "2c2l_D", "2c2v_S", "2c2v_T", "2oxq_D", "2c2v_U", "2oxq_C", "2c2v_V", "5olm_B", "7bbd_B", "6s53_G", "6s53_A", "6fga_C", "6fga_F", "6fga_H"]}
##Output
2024-02-29 10:41:22,688 Running colabfold 1.5.5 (06c775a287a891b5f8e81a88e52bcadc4dd67cd2)
2024-02-29 10:44:15,727 Running on GPU
2024-02-29 10:44:16,392 Found 9 citations for tools or databases
2024-02-29 10:44:19,723 WARNING: Found 20 models in predictions_56426093/templates/1t1h.cif. The first model will be used as a template.
2024-02-29 10:44:21,434 Query 1/1: heterodimer_2 (length 642)
2024-02-29 10:44:29,529 Sequence 0 found templates: ['7x8v_A', '7x8v_A', '7x8v_A', '7x8v_A', '7x8v_A', '7x8v_A', '6s53_I', '6s53_K', '7bbd_D']
2024-02-29 10:44:33,588 Sequence 1 found templates: ['1t1h_A', '2f42_A', '7c96_B', '2c2l_C', '2c2l_B', '2c2l_A', '2c2l_D', '2c2v_S', '2c2v_T', '2oxq_D', '2c2v_U', '2oxq_C', '2c2v_V', '5olm_B', '7bbd_B', '6s53_G', '6s53_A', '6fga_C', '6fga_F', '6fga_H']
2024-02-29 10:44:35,289 Setting max_seq=508, max_extra_seq=1690
2024-02-29 10:47:22,893 alphafold2_multimer_v3_model_1_seed_000 recycle=0 pLDDT=76.4 pTM=0.519 ipTM=0.142
2024-02-29 10:47:33,100 alphafold2_multimer_v3_model_1_seed_000 recycle=1 pLDDT=83.8 pTM=0.744 ipTM=0.823 tol=10.6
2024-02-29 10:47:43,307 alphafold2_multimer_v3_model_1_seed_000 recycle=2 pLDDT=85.1 pTM=0.759 ipTM=0.866 tol=4.78
2024-02-29 10:47:53,510 alphafold2_multimer_v3_model_1_seed_000 recycle=3 pLDDT=85.2 pTM=0.754 ipTM=0.857 tol=2.23
2024-02-29 10:48:03,729 alphafold2_multimer_v3_model_1_seed_000 recycle=4 pLDDT=85.6 pTM=0.762 ipTM=0.869 tol=0.629
2024-02-29 10:48:13,948 alphafold2_multimer_v3_model_1_seed_000 recycle=5 pLDDT=85.5 pTM=0.758 ipTM=0.867 tol=0.294
Question
The templates 6s53_I, 6s53_K and 7bbd_D should not be used for protein A, based on the .m8 file. I have not been able to fully understand how the template_domain_names.json is generated within the code. In general the template generation does not seem to be consistent. Is this something that can be solved? Is there something wrong in my approach?
The template search is essentially redone during colabfold_batch via the same mechanism as the custom template database flag, just with a selection of .cif files provided by the .m8 file based on the number of max templates set. So the during the colabfold_batch run both Sequence 0 and Sequence 1 share the same template database and have a chance to match to each others templates. I think AlphaFold can was trained to deal with various degrees of template matching just fine. (you reach a high ipTM value by recycle 1). I think you would have to manually change the code to use a different custom template database per unique query sequence.
@NickWoodall Thank you for the response. I can see that the template folders that are create by the server compared to the local approach are slightly different.
Directory generated by colabfold_batch --templates --amber --use-gpu-relax input.csv output:
.
├── cite.bibtex
├── config.json
├── log.txt
├── heterodimer_2.a3m
├── heterodimer_2_coverage.png
├── heterodimer_2.done.txt
├── heterodimer_2_env
│ ├── bfd.mgnify30.metaeuk30.smag30.a3m
│ ├── msa.sh
│ ├── out.tar.gz
│ ├── pdb70.m8
│ ├── templates_101
│ │ ├── 7x8v.cif
│ │ ├── pdb70_a3m.ffdata
│ │ ├── pdb70_a3m.ffindex
│ │ ├── pdb70_cs219.ffdata
│ │ └── pdb70_cs219.ffindex -> pdb70_a3m.ffindex
│ ├── templates_102
│ │ ├── 1t1h.cif
│ │ ├── 2c2l.cif
│ │ ├── 2c2v.cif
│ │ ├── 2f42.cif
│ │ ├── 2oxq.cif
│ │ ├── 5olm.cif
│ │ ├── 6fga.cif
│ │ ├── 6s53.cif
│ │ ├── 7bbd.cif
│ │ ├── 7c96.cif
│ │ ├── 8a58.cif
│ │ ├── pdb70_a3m.ffdata
│ │ ├── pdb70_a3m.ffindex
│ │ ├── pdb70_cs219.ffdata
│ │ └── pdb70_cs219.ffindex -> pdb70_a3m.ffindex
│ └── uniref.a3m
├── heterodimer_2_pae.png
├── heterodimer_2_pairgreedy
│ ├── out.tar.gz
│ ├── pair.a3m
│ └── pair.sh
├── heterodimer_2_plddt.png
├── heterodimer_2_predicted_aligned_error_v1.json
├── heterodimer_2_relaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_relaxed_rank_002_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_relaxed_rank_003_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_relaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
├── heterodimer_2_relaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
├── heterodimer_2_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json
├── heterodimer_2_scores_rank_002_alphafold2_multimer_v3_model_3_seed_000.json
├── heterodimer_2_scores_rank_003_alphafold2_multimer_v3_model_5_seed_000.json
├── heterodimer_2_scores_rank_004_alphafold2_multimer_v3_model_2_seed_000.json
├── heterodimer_2_scores_rank_005_alphafold2_multimer_v3_model_4_seed_000.json
├── heterodimer_2_template_domain_names.json
├── heterodimer_2_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_002_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_003_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
└── heterodimer_2_unrelaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
Directory generated by
LOCALPDBPATH="/data/scratch/datasets/alphafold/v2.3.2/pdb_mmcif/mmcif_files"
RANDOMSEED=0
PDBHITFILE="heterodimer_2_pdb100_230517.m8"
# Run the colabfold_batch command
colabfold_batch \
--amber \
--templates \
--use-gpu-relax \
--pdb-hit-file msas/${PDBHITFILE} \
--local-pdb-path ${LOCALPDBPATH} \
--random-seed ${RANDOMSEED} \
msas/heterodimer_2.a3m \
output
:
.
├── cite.bibtex
├── config.json
├── log.txt
├── heterodimer_2.a3m
├── heterodimer_2_coverage.png
├── heterodimer_2.done.txt
├── heterodimer_2_pae.png
├── heterodimer_2_plddt.png
├── heterodimer_2_predicted_aligned_error_v1.json
├── heterodimer_2_relaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_relaxed_rank_002_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_relaxed_rank_003_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_relaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
├── heterodimer_2_relaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
├── heterodimer_2_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json
├── heterodimer_2_scores_rank_002_alphafold2_multimer_v3_model_5_seed_000.json
├── heterodimer_2_scores_rank_003_alphafold2_multimer_v3_model_3_seed_000.json
├── heterodimer_2_scores_rank_004_alphafold2_multimer_v3_model_2_seed_000.json
├── heterodimer_2_scores_rank_005_alphafold2_multimer_v3_model_4_seed_000.json
├── heterodimer_2_template_domain_names.json
├── heterodimer_2_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_002_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_003_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
└── templates
├── 1t1h.cif
├── 2c2l.cif
├── 2c2v.cif
├── 2f42.cif
├── 2oxq.cif
├── 5olm.cif
├── 6fga.cif
├── 6s53.cif
├── 7bbd.cif
├── 7c96.cif
├── 7x8v.cif
├── pdb70_a3m.ffdata
├── pdb70_a3m.ffindex
├── pdb70_cs219.ffdata
└── pdb70_cs219.ffindex
I am still a bit concerned about reproducibility and about the wrong templates being used for a protein. Is this something that will be addressed by updates in the code or should I not be concerned?
there is one big remaining difference that we tend to address at some point in the future:
In server mode, we fetch diverse precomputed MSAs in A3M format for each template (the PDB70 hh-suite db) and do the alignment based on the query A3M vs the template A3Ms.
The current implementation of the local templates only does a query A3M vs single template sequence.
Ideally, the local template search should also fetch the A3Ms from a locally available PDB70 hh-suite db.
@milot-mirdita Do you think that difference will affect the results significantly? Do you have any timeline for when this will be addressed?
I am also interested in how similar would the results be. I have performed some analyses on online mode and I want to scale it up, but I am wondering whether running it locally would provide too different results.