ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

Local generation of .m8 file causes wrong template selection in colabfold_batch

Open Nuta0 opened this issue 1 year ago • 5 comments

Expected Behavior

This is my input.csv file:

id,sequence
heterodimer_2,MAAEAWRSRFRERVVEAAERWESVGESLATALTHLKSPMHAGDEEEAAAARTRIQLAMGELVDASRNLASAMSLMKVAELLALHGGSVNPSTHLGEISLLGDQYLAERNAGIKLLEAGKDARKAYISVDGCRGNLDAILLLLDHPRVPCVDDFIEEELFVAGDNLQGAIGNAKLGTERAVGARQDVSGAN:MDAAVAGQHARRRIRPPEPLVMAGSPSTPAAFRCPISLEVMRSPVSLPTGATYDRASIQRWLDTGHRTCPATRLPLASTDLVPNLLLRRLIHLHAATLPPSPSPEVVLSQLAAAGGEPAAAEKAVRSLAAKIAPEKGKRASVASAVAADLDSAVPALLSFAKGGAGADARVDAVRILATVAPELVPYLTGDGTEKRGRVRMAVEALAAVLSADGVGEDTKEGLIAALVAGDLGHIVNTLIAAGANGVMVLETILTSPVPDADAKTAIADRSELFPDLVRILKDAASPAAIRCMAAAVQVRGRPARSSMVRAGAIPALALAVAAAPTAVAESALGLLVEAARCTDGKAAIGADAAEVAAAVMGRMIRVGPAGREFAVAVLWLSCCAGGGDRRMREAVASAPEAVGKLLVVMQGDCSPSTSRMAGELLRAVRMEQERKGLAAAYDSRTIHVMPY

Running colabfold_batch --templates --amber --use-gpu-relax input.csv output yields the following .m8 file and template_domain_names.json file: First 10 lines of .m8:

101     7x8v_E  1.000   184     0       0       4       187     1       184     1.956E-67       228     184M
102     7c96_B  0.528   70      33      0       26      95      1       70      6.214E-22       103     70M
102     1t1h_A  0.520   75      36      0       24      98      1       75      1.968E-20       98      75M
102     1t1h_A  0.520   75      36      0       24      98      1       75      1.968E-20       98      75M
102     1t1h_A  0.520   75      36      0       24      98      1       75      1.968E-20       98      75M
102     1t1h_A  0.520   75      36      0       24      98      1       75      1.968E-20       98      75M
102     2f42_A  0.347   72      47      0       27      98      63      134     5.527E-15       81      72M
102     2oxq_D  0.333   72      48      0       27      98      1       72      1.714E-14       79      72M
102     2c2v_U  0.367   68      43      0       27      94      1       68      1.714E-14       79      68M
102     2c2v_S  0.367   68      43      0       27      94      1       68      1.714E-14       79      68M

template_domain_names.json: {"A": ["7x8v_E", "7x8v_E"], "B": ["7c96_B", "2c2l_C", "2c2l_D", "1t1h_A", "2f42_A", "2oxq_D", "2c2v_T", "2c2v_U", "2c2v_S", "2c2v_V", "7bbd_B", "6fga_E", "5olm_B", "8a58_D", "6fga_H", "6fga_D", "6s53_H"]}

This is to be expected I guess.

Current Behavior

When running everything locally like this:

input_file="input.csv"
DATABASE_PATH="/data/gpfs/datasets/mmseqs/uniref30_2302"

colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --mmseqs /apps/easybuild-2022/easybuild/software/MPI/GCC/11.3.0/OpenMPI/4.1.4/MMseqs2/15-6f452/bin/mmseqs \
  --db2 pdb100_230517 \
  --threads 8 \
  ${input_file} \
  ${DATABASE_PATH} \
  msas
LOCALPDBPATH="/data/scratch/datasets/alphafold/v2.3.2/pdb_mmcif/mmcif_files"
RANDOMSEED=0
PDBHITFILE="heterodimer_2_pdb100_230517.m8"

  # Run the colabfold_batch command
  colabfold_batch \
    --amber \
    --templates \
    --use-gpu-relax \
    --pdb-hit-file msas/${PDBHITFILE} \
    --local-pdb-path ${LOCALPDBPATH} \
    --random-seed ${RANDOMSEED} \
    msas/heterodimer_2.a3m \
    output

I get the same .m8 file as above, however, the template_domain_names.json file is different and contains templates for A that are not in the .m8 file: {"A": ["7x8v_A", "7x8v_A", "7x8v_A", "7x8v_A", "7x8v_A", "7x8v_A", "6s53_I", "6s53_K", "7bbd_D"], "B": ["1t1h_A", "2f42_A", "7c96_B", "2c2l_C", "2c2l_B", "2c2l_A", "2c2l_D", "2c2v_S", "2c2v_T", "2oxq_D", "2c2v_U", "2oxq_C", "2c2v_V", "5olm_B", "7bbd_B", "6s53_G", "6s53_A", "6fga_C", "6fga_F", "6fga_H"]}

##Output

2024-02-29 10:41:22,688 Running colabfold 1.5.5 (06c775a287a891b5f8e81a88e52bcadc4dd67cd2)
2024-02-29 10:44:15,727 Running on GPU
2024-02-29 10:44:16,392 Found 9 citations for tools or databases
2024-02-29 10:44:19,723 WARNING: Found 20 models in predictions_56426093/templates/1t1h.cif. The first model will be used as a template.
2024-02-29 10:44:21,434 Query 1/1: heterodimer_2 (length 642)
2024-02-29 10:44:29,529 Sequence 0 found templates: ['7x8v_A', '7x8v_A', '7x8v_A', '7x8v_A', '7x8v_A', '7x8v_A', '6s53_I', '6s53_K', '7bbd_D']
2024-02-29 10:44:33,588 Sequence 1 found templates: ['1t1h_A', '2f42_A', '7c96_B', '2c2l_C', '2c2l_B', '2c2l_A', '2c2l_D', '2c2v_S', '2c2v_T', '2oxq_D', '2c2v_U', '2oxq_C', '2c2v_V', '5olm_B', '7bbd_B', '6s53_G', '6s53_A', '6fga_C', '6fga_F', '6fga_H']
2024-02-29 10:44:35,289 Setting max_seq=508, max_extra_seq=1690
2024-02-29 10:47:22,893 alphafold2_multimer_v3_model_1_seed_000 recycle=0 pLDDT=76.4 pTM=0.519 ipTM=0.142
2024-02-29 10:47:33,100 alphafold2_multimer_v3_model_1_seed_000 recycle=1 pLDDT=83.8 pTM=0.744 ipTM=0.823 tol=10.6
2024-02-29 10:47:43,307 alphafold2_multimer_v3_model_1_seed_000 recycle=2 pLDDT=85.1 pTM=0.759 ipTM=0.866 tol=4.78
2024-02-29 10:47:53,510 alphafold2_multimer_v3_model_1_seed_000 recycle=3 pLDDT=85.2 pTM=0.754 ipTM=0.857 tol=2.23
2024-02-29 10:48:03,729 alphafold2_multimer_v3_model_1_seed_000 recycle=4 pLDDT=85.6 pTM=0.762 ipTM=0.869 tol=0.629
2024-02-29 10:48:13,948 alphafold2_multimer_v3_model_1_seed_000 recycle=5 pLDDT=85.5 pTM=0.758 ipTM=0.867 tol=0.294

Question

The templates 6s53_I, 6s53_K and 7bbd_D should not be used for protein A, based on the .m8 file. I have not been able to fully understand how the template_domain_names.json is generated within the code. In general the template generation does not seem to be consistent. Is this something that can be solved? Is there something wrong in my approach?

Nuta0 avatar Feb 29 '24 03:02 Nuta0

The template search is essentially redone during colabfold_batch via the same mechanism as the custom template database flag, just with a selection of .cif files provided by the .m8 file based on the number of max templates set. So the during the colabfold_batch run both Sequence 0 and Sequence 1 share the same template database and have a chance to match to each others templates. I think AlphaFold can was trained to deal with various degrees of template matching just fine. (you reach a high ipTM value by recycle 1). I think you would have to manually change the code to use a different custom template database per unique query sequence.

NickWoodall avatar Feb 29 '24 18:02 NickWoodall

@NickWoodall Thank you for the response. I can see that the template folders that are create by the server compared to the local approach are slightly different.

Directory generated by colabfold_batch --templates --amber --use-gpu-relax input.csv output:

.
├── cite.bibtex
├── config.json
├── log.txt
├── heterodimer_2.a3m
├── heterodimer_2_coverage.png
├── heterodimer_2.done.txt
├── heterodimer_2_env
│   ├── bfd.mgnify30.metaeuk30.smag30.a3m
│   ├── msa.sh
│   ├── out.tar.gz
│   ├── pdb70.m8
│   ├── templates_101
│   │   ├── 7x8v.cif
│   │   ├── pdb70_a3m.ffdata
│   │   ├── pdb70_a3m.ffindex
│   │   ├── pdb70_cs219.ffdata
│   │   └── pdb70_cs219.ffindex -> pdb70_a3m.ffindex
│   ├── templates_102
│   │   ├── 1t1h.cif
│   │   ├── 2c2l.cif
│   │   ├── 2c2v.cif
│   │   ├── 2f42.cif
│   │   ├── 2oxq.cif
│   │   ├── 5olm.cif
│   │   ├── 6fga.cif
│   │   ├── 6s53.cif
│   │   ├── 7bbd.cif
│   │   ├── 7c96.cif
│   │   ├── 8a58.cif
│   │   ├── pdb70_a3m.ffdata
│   │   ├── pdb70_a3m.ffindex
│   │   ├── pdb70_cs219.ffdata
│   │   └── pdb70_cs219.ffindex -> pdb70_a3m.ffindex
│   └── uniref.a3m
├── heterodimer_2_pae.png
├── heterodimer_2_pairgreedy
│   ├── out.tar.gz
│   ├── pair.a3m
│   └── pair.sh
├── heterodimer_2_plddt.png
├── heterodimer_2_predicted_aligned_error_v1.json
├── heterodimer_2_relaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_relaxed_rank_002_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_relaxed_rank_003_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_relaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
├── heterodimer_2_relaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
├── heterodimer_2_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json
├── heterodimer_2_scores_rank_002_alphafold2_multimer_v3_model_3_seed_000.json
├── heterodimer_2_scores_rank_003_alphafold2_multimer_v3_model_5_seed_000.json
├── heterodimer_2_scores_rank_004_alphafold2_multimer_v3_model_2_seed_000.json
├── heterodimer_2_scores_rank_005_alphafold2_multimer_v3_model_4_seed_000.json
├── heterodimer_2_template_domain_names.json
├── heterodimer_2_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_002_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_003_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
└── heterodimer_2_unrelaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb

Directory generated by

LOCALPDBPATH="/data/scratch/datasets/alphafold/v2.3.2/pdb_mmcif/mmcif_files"
RANDOMSEED=0
PDBHITFILE="heterodimer_2_pdb100_230517.m8"

  # Run the colabfold_batch command
  colabfold_batch \
    --amber \
    --templates \
    --use-gpu-relax \
    --pdb-hit-file msas/${PDBHITFILE} \
    --local-pdb-path ${LOCALPDBPATH} \
    --random-seed ${RANDOMSEED} \
    msas/heterodimer_2.a3m \
    output

:

.
├── cite.bibtex
├── config.json
├── log.txt
├── heterodimer_2.a3m
├── heterodimer_2_coverage.png
├── heterodimer_2.done.txt
├── heterodimer_2_pae.png
├── heterodimer_2_plddt.png
├── heterodimer_2_predicted_aligned_error_v1.json
├── heterodimer_2_relaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_relaxed_rank_002_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_relaxed_rank_003_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_relaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
├── heterodimer_2_relaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
├── heterodimer_2_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json
├── heterodimer_2_scores_rank_002_alphafold2_multimer_v3_model_5_seed_000.json
├── heterodimer_2_scores_rank_003_alphafold2_multimer_v3_model_3_seed_000.json
├── heterodimer_2_scores_rank_004_alphafold2_multimer_v3_model_2_seed_000.json
├── heterodimer_2_scores_rank_005_alphafold2_multimer_v3_model_4_seed_000.json
├── heterodimer_2_template_domain_names.json
├── heterodimer_2_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_002_alphafold2_multimer_v3_model_5_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_003_alphafold2_multimer_v3_model_3_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_004_alphafold2_multimer_v3_model_2_seed_000.pdb
├── heterodimer_2_unrelaxed_rank_005_alphafold2_multimer_v3_model_4_seed_000.pdb
└── templates
    ├── 1t1h.cif
    ├── 2c2l.cif
    ├── 2c2v.cif
    ├── 2f42.cif
    ├── 2oxq.cif
    ├── 5olm.cif
    ├── 6fga.cif
    ├── 6s53.cif
    ├── 7bbd.cif
    ├── 7c96.cif
    ├── 7x8v.cif
    ├── pdb70_a3m.ffdata
    ├── pdb70_a3m.ffindex
    ├── pdb70_cs219.ffdata
    └── pdb70_cs219.ffindex

I am still a bit concerned about reproducibility and about the wrong templates being used for a protein. Is this something that will be addressed by updates in the code or should I not be concerned?

Nuta0 avatar Mar 07 '24 00:03 Nuta0

there is one big remaining difference that we tend to address at some point in the future:

In server mode, we fetch diverse precomputed MSAs in A3M format for each template (the PDB70 hh-suite db) and do the alignment based on the query A3M vs the template A3Ms.

The current implementation of the local templates only does a query A3M vs single template sequence.

Ideally, the local template search should also fetch the A3Ms from a locally available PDB70 hh-suite db.

milot-mirdita avatar Mar 07 '24 02:03 milot-mirdita

@milot-mirdita Do you think that difference will affect the results significantly? Do you have any timeline for when this will be addressed?

Nuta0 avatar Apr 02 '24 22:04 Nuta0

I am also interested in how similar would the results be. I have performed some analyses on online mode and I want to scale it up, but I am wondering whether running it locally would provide too different results.

CBorreda avatar Oct 18 '24 08:10 CBorreda