alphafold
alphafold copied to clipboard
HMMER MSAs aren't saved and repeated when running with --use_precomputed_msas=True
I separated the MSA part and am doing it with ProtreinFold on CPU. After this I run AlphaFold on GPU with:
--use_precomputed_msas=True
Alphafold correctly identifies all jackhammer alignments but repeats a bunch of HMMER alignments which on ocasion can take quite some time.
I suspect this is happening because HMMER output is always directed to TMPDIR and is deleted after running all MSAs, causing AlphaFold to repeat it. I also suspect HMMER output isn't needed as long as you have all the .sto's, last one being uniprot_hits.sto
Is there any way to manually disable HMMER and start with the step:
run_alphafold.py:191] Running model model_1_multimer_v3_pred_0 on K7R_CENPF
Example HMMER repeat alignments:
hmmbuild.py:121] Launching subprocess ['/usr/bin/hmmbuild', '--hand', '--amino', '/scratch/tmp_AF/K7R_CENPF_10174/tmp_82megdk/output.hmm', '/scratch/tmp_AF/K7R_CENPF_10174/tmp_82megdk/query.msa']
I0206 16:20:13.699746 23223541839680 utils.py:36] Started hmmbuild query
I0206 16:20:16.644826 23223541839680 hmmbuild.py:128] hmmbuild stdout:
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b2 (February 2015); http://hmmer.org/
# Copyright (C) 2015 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
...
utils.py:40] Finished hmmbuild query in 2.945 seconds
I0206 16:20:16.661780 23223541839680 hmmsearch.py:103] Launching sub-process ['/usr/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/scratch/tmp_AF/K7R_CENPF_10174/tmpksyfr5el/output.sto', '/scratch/tmp_AF/K7R_CENPF_10174/tmpksyfr5el/query.hmm', '/wynton/group/gladstone/users/apelin/alphafold_DBs/pdb_seqres/pdb_seqres.txt']
I0206 16:20:16.697458 23223541839680 utils.py:36] Started hmmsearch (pdb_seqres.txt) query
I0206 16:22:01.650443 23223541839680 utils.py:40] Finished hmmsearch (pdb_seqres.txt) query in 104.952 seconds
I think mine are being saved but aren't being used. I'm looking into this and will let you know if I find anything but yea, I'm running the same issue.
@apelin20 must be something wrong with the cli. If you modify the flag's' default to be True
in run_docker.py
then it works fyi.
Hey @dltacube thanks for looking into this. Can you elaborate a little bit, what is cli? and which flag should I change to True
in run_docker.py
? My singularity container invokes alphafold/run_alphafold.py
Sorry, I am a bit newb at this
That's totally fine. If you go to this file here on that line and change False
to True
I think it might work. modifying the docker_run.py
file wouldn't help in your case.
CLI is just command line interface and the part of the code that accepts options like use_precomputed_msas
I think might be where the issue lies. Hence why modifying the default value could solve your problem.
hmmsearch and hhsearch are used for template search. Therefore it will run even if --use_precomputed_msas flag is present. https://github.com/google-deepmind/alphafold/issues/469#issuecomment-1131870718 The DBs for templates are relatively small so it would not take much time... but it depends on users and their targets.
Thanks for clarifying. Our resource/cluster grants faster access to gpu jobs if the job has a 2h time limit so every little bit of time counts. I suck at python but is there no way to redirect search results to msas folder and check if present to avoid rerunning?
On Thu, Feb 15, 2024 at 7:26 AM yamabuki-chan @.***> wrote:
hmmsearch and hhsearch are used for template search. Therefore it will run even if --use_precomputed_msas flag is present. #469 (comment) https://github.com/google-deepmind/alphafold/issues/469#issuecomment-1131870718 The DBs for templates are relatively small so it would not take much time... but it depends on users and their targets.
— Reply to this email directly, view it on GitHub https://github.com/google-deepmind/alphafold/issues/895#issuecomment-1946322300, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCEA2YKF5SY25XC5RD2CMTYTYSL5AVCNFSM6AAAAABC4WVO5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBWGMZDEMZQGA . You are receiving this because you were mentioned.Message ID: @.***>
I think there is no way to redirect without changing some lines of the python script. I think the easiest way to speed up is increase the number of cpus to be used. https://github.com/google-deepmind/alphafold/blob/632ef575c64eff9eb5ed96c8c7b055bf675421ac/alphafold/data/tools/hmmsearch.py#L92 https://github.com/google-deepmind/alphafold/blob/632ef575c64eff9eb5ed96c8c7b055bf675421ac/alphafold/data/tools/hhsearch.py#L79 (adding '-cpu','number of cpu' around here, i guess) but the effect will be limited. Possibly no effect when the bottle neck was file IO or your cpu does not have such many cores.
There is. I mentioned how in my comment. Just change the default flag to true and it'll do it.
Template search (hmmsearch, hhsearch) is not controlled by --use_precomputed_msas flag, I think. https://github.com/google-deepmind/alphafold/blob/632ef575c64eff9eb5ed96c8c7b055bf675421ac/alphafold/data/pipeline.py#L184-L191
The result from the template search is actually saved. The issue is, like @yamule is pointing at, that the code doesn't check whether the output file already exists. In our modifications we have changed this but since the template search is dependent upon the previous parts you would ideally need to compare the timestamps as well before using it. We don't do this, hence why I'm reluctant to share it at this point.
If you want to minimize time spent on redoing work already done on the CPU you would even check if features.pkl already exists and load it into feature_dict .