alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

HMMER MSAs aren't saved and repeated when running with --use_precomputed_msas=True

Open apelin20 opened this issue 1 year ago • 10 comments

I separated the MSA part and am doing it with ProtreinFold on CPU. After this I run AlphaFold on GPU with: --use_precomputed_msas=True

Alphafold correctly identifies all jackhammer alignments but repeats a bunch of HMMER alignments which on ocasion can take quite some time.

I suspect this is happening because HMMER output is always directed to TMPDIR and is deleted after running all MSAs, causing AlphaFold to repeat it. I also suspect HMMER output isn't needed as long as you have all the .sto's, last one being uniprot_hits.sto

Is there any way to manually disable HMMER and start with the step: run_alphafold.py:191] Running model model_1_multimer_v3_pred_0 on K7R_CENPF

Example HMMER repeat alignments:

hmmbuild.py:121] Launching subprocess ['/usr/bin/hmmbuild', '--hand', '--amino', '/scratch/tmp_AF/K7R_CENPF_10174/tmp_82megdk/output.hmm', '/scratch/tmp_AF/K7R_CENPF_10174/tmp_82megdk/query.msa']
I0206 16:20:13.699746 23223541839680 utils.py:36] Started hmmbuild query
I0206 16:20:16.644826 23223541839680 hmmbuild.py:128] hmmbuild stdout:
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b2 (February 2015); http://hmmer.org/
# Copyright (C) 2015 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
...
utils.py:40] Finished hmmbuild query in 2.945 seconds
I0206 16:20:16.661780 23223541839680 hmmsearch.py:103] Launching sub-process ['/usr/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/scratch/tmp_AF/K7R_CENPF_10174/tmpksyfr5el/output.sto', '/scratch/tmp_AF/K7R_CENPF_10174/tmpksyfr5el/query.hmm', '/wynton/group/gladstone/users/apelin/alphafold_DBs/pdb_seqres/pdb_seqres.txt']
I0206 16:20:16.697458 23223541839680 utils.py:36] Started hmmsearch (pdb_seqres.txt) query
I0206 16:22:01.650443 23223541839680 utils.py:40] Finished hmmsearch (pdb_seqres.txt) query in 104.952 seconds

apelin20 avatar Feb 06 '24 21:02 apelin20

I think mine are being saved but aren't being used. I'm looking into this and will let you know if I find anything but yea, I'm running the same issue.

dltacube avatar Feb 14 '24 00:02 dltacube

@apelin20 must be something wrong with the cli. If you modify the flag's' default to be True in run_docker.py then it works fyi.

dltacube avatar Feb 14 '24 16:02 dltacube

Hey @dltacube thanks for looking into this. Can you elaborate a little bit, what is cli? and which flag should I change to True in run_docker.py? My singularity container invokes alphafold/run_alphafold.py

Sorry, I am a bit newb at this

apelin20 avatar Feb 14 '24 18:02 apelin20

That's totally fine. If you go to this file here on that line and change False to True I think it might work. modifying the docker_run.py file wouldn't help in your case.

CLI is just command line interface and the part of the code that accepts options like use_precomputed_msas I think might be where the issue lies. Hence why modifying the default value could solve your problem.

dltacube avatar Feb 14 '24 18:02 dltacube

hmmsearch and hhsearch are used for template search. Therefore it will run even if --use_precomputed_msas flag is present. https://github.com/google-deepmind/alphafold/issues/469#issuecomment-1131870718 The DBs for templates are relatively small so it would not take much time... but it depends on users and their targets.

yamule avatar Feb 15 '24 15:02 yamule

Thanks for clarifying. Our resource/cluster grants faster access to gpu jobs if the job has a 2h time limit so every little bit of time counts. I suck at python but is there no way to redirect search results to msas folder and check if present to avoid rerunning?

On Thu, Feb 15, 2024 at 7:26 AM yamabuki-chan @.***> wrote:

hmmsearch and hhsearch are used for template search. Therefore it will run even if --use_precomputed_msas flag is present. #469 (comment) https://github.com/google-deepmind/alphafold/issues/469#issuecomment-1131870718 The DBs for templates are relatively small so it would not take much time... but it depends on users and their targets.

— Reply to this email directly, view it on GitHub https://github.com/google-deepmind/alphafold/issues/895#issuecomment-1946322300, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCEA2YKF5SY25XC5RD2CMTYTYSL5AVCNFSM6AAAAABC4WVO5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBWGMZDEMZQGA . You are receiving this because you were mentioned.Message ID: @.***>

apelin20 avatar Feb 15 '24 15:02 apelin20

I think there is no way to redirect without changing some lines of the python script. I think the easiest way to speed up is increase the number of cpus to be used. https://github.com/google-deepmind/alphafold/blob/632ef575c64eff9eb5ed96c8c7b055bf675421ac/alphafold/data/tools/hmmsearch.py#L92 https://github.com/google-deepmind/alphafold/blob/632ef575c64eff9eb5ed96c8c7b055bf675421ac/alphafold/data/tools/hhsearch.py#L79 (adding '-cpu','number of cpu' around here, i guess) but the effect will be limited. Possibly no effect when the bottle neck was file IO or your cpu does not have such many cores.

yamule avatar Feb 16 '24 11:02 yamule

There is. I mentioned how in my comment. Just change the default flag to true and it'll do it.

dltacube avatar Feb 17 '24 12:02 dltacube

Template search (hmmsearch, hhsearch) is not controlled by --use_precomputed_msas flag, I think. https://github.com/google-deepmind/alphafold/blob/632ef575c64eff9eb5ed96c8c7b055bf675421ac/alphafold/data/pipeline.py#L184-L191

yamule avatar Feb 18 '24 08:02 yamule

The result from the template search is actually saved. The issue is, like @yamule is pointing at, that the code doesn't check whether the output file already exists. In our modifications we have changed this but since the template search is dependent upon the previous parts you would ideally need to compare the timestamps as well before using it. We don't do this, hence why I'm reluctant to share it at this point.

If you want to minimize time spent on redoing work already done on the CPU you would even check if features.pkl already exists and load it into feature_dict .

fredricj avatar Feb 19 '24 07:02 fredricj