openfold
openfold copied to clipboard
Multiple Protein Sequences and speed of alignment runner
Hi
First of all Great work, just a question :
is this possible to pass a batch of protein sequences to the alignment_runner.run(fasta_path, local_alignment_dir) in the run_pretrained_openfold.py file and save some time by passing the batches instead of passing each prot_seq to this function.
from what I understand the bottleneck of the code to search against chunks of genetic databases is this part and I tried with multiple CPUs to get it faster but it didn't work by increasing the number of CPUs. any suggestion for speeding up this part of the code?
Take a look at scripts/precompute_alignments.py, a multithreaded alignment runner. This should be faster than just increasing the number of CPUs available to the alignment runner directly (some of the alignment tools don't scale very well). It's possible that the script is a little out of date, so do let me know if you encounter any issues.
Thanks for your answer, I checked the file and I still have some questions.
- for the input could I pass a fasta file with multiple sequences in it or I should create a unique fasta file for each sequence and put all of them in an input_dir folder?
- what is the role of
args.mmcif_cache? how should I generate these mmcif_cache files? I noticed that there is a file with the namegenerate_mmcif_cache.py, could you elaborate more on that?
Sorry for the delayed reply.
- You should do the latter, with one sequence per .fasta file.
- The mmcif cache contains some information from the mmCIF files you're using, e.g. the release date of the protein in question. It's primarily used during the template search to filter out disqualifying proteins without having to perform an expensive mmCIF parse. It's generated using
scripts/generate_mmcif_cache.py, which is also multithreaded and runs pretty quickly on the PDB mmCIFs, for example.
Thanks for the clarification.
I followed your suggestion and created 1740 different Fasta File, I run the precompute_alignments.py on an instance with 96 CPUS with 600 GB RAM with the --no_tasks = 24 and --cpus_per_task = cpu_count() but after a while, the system crashed and the run is killed because it is out of memory. do you have any suggestions for me on how to run these 1740 fasta files?
Should I decrease the --no_tasks = 24 ?
Also I get this error for one of the fasta files, Do you know what's the problem here for running the alignment for this specific protein?
ERROR:root:HHblits failed. HHblits stderr begin:
ERROR:root:- 16:33:30.426 INFO: Searching 65983866 column state sequences.
ERROR:root:- 16:33:32.178 INFO: Searching 15161831 column state sequences.
ERROR:root:- 16:33:32.287 INFO: /tmp/tmpwysaayvu.fasta is in A2M, A3M or FASTA format
ERROR:root:- 16:33:32.288 INFO: Iteration 1
ERROR:root:- 16:33:32.457 INFO: Prefiltering database
ERROR:root:- 16:36:06.235 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment) : 1499654
ERROR:root:- 16:36:32.380 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment) : 390893
ERROR:root:- 16:36:35.782 WARNING: database contains sequences that exceed maximum allowed size (maxres = 20001). Max sequence length can be increased with parameter -maxres.
ERROR:root:- 16:36:42.437 INFO: HMMs passed 2nd prefilter (gapped profile-profile alignment) : 98508
ERROR:root:- 16:36:42.437 INFO: HMMs passed 2nd prefilter and not found in previous iterations : 98508
ERROR:root:- 16:36:42.437 INFO: Scoring 98508 HMMs using HMM-HMM Viterbi alignment
ERROR:root:- 16:36:51.799 INFO: Alternative alignment: 0
ERROR:root:- 16:36:57.429 INFO: 2000 alignments done
ERROR:root:- 16:37:01.538 INFO: 4000 alignments done
ERROR:root:- 16:37:04.911 INFO: 6000 alignments done
ERROR:root:- 16:37:06.851 INFO: 8000 alignments done
ERROR:root:- 16:37:08.177 INFO: 10000 alignments done
ERROR:root:- 16:37:09.859 INFO: 12000 alignments done
ERROR:root:- 16:37:12.247 INFO: 14000 alignments done
ERROR:root:- 16:37:13.215 INFO: 16000 alignments done
ERROR:root:- 16:37:15.191 INFO: 18000 alignments done
ERROR:root:- 16:37:17.731 INFO: 20000 alignments done
ERROR:root:- 16:37:19.125 INFO: 22000 alignments done
ERROR:root:- 16:37:20.715 INFO: 24000 alignments done
ERROR:root:- 16:37:22.543 INFO: 26000 alignments done
ERROR:root:- 16:37:24.891 INFO: 28000 alignments done
ERROR:root:- 16:37:26.955 INFO: 30000 alignments done
ERROR:root:- 16:37:28.591 INFO: 32000 alignments done
ERROR:root:- 16:37:31.543 INFO: 34000 alignments done
ERROR:root:- 16:37:34.015 INFO: 36000 alignments done
ERROR:root:- 16:37:35.903 INFO: 38000 alignments done
ERROR:root:- 16:37:38.063 INFO: 40000 alignments done
ERROR:root:- 16:37:39.931 INFO: 42000 alignments done
ERROR:root:- 16:37:41.799 INFO: 44000 alignments done
ERROR:root:- 16:37:43.663 INFO: 46000 alignments done
ERROR:root:- 16:37:45.491 INFO: 48000 alignments done
ERROR:root:- 16:37:47.319 INFO: 50000 alignments done
ERROR:root:- 16:37:49.557 INFO: 52000 alignments done
ERROR:root:- 16:37:51.575 INFO: 54000 alignments done
ERROR:root:- 16:37:53.605 INFO: 56000 alignments done
ERROR:root:- 16:37:55.639 INFO: 58000 alignments done
ERROR:root:- 16:37:57.055 INFO: 60000 alignments done
ERROR:root:- 16:37:58.875 INFO: 62000 alignments done
ERROR:root:- 16:37:59.589 INFO: 64000 alignments done
ERROR:root:- 16:38:00.047 INFO: 66000 alignments done
ERROR:root:- 16:38:01.035 INFO: 68000 alignments done
ERROR:root:- 16:38:01.002 INFO: 70000 alignments done
ERROR:root:- 16:38:02.804 INFO: 72000 alignments done
ERROR:root:- 16:38:03.779 INFO: 74000 alignments done
ERROR:root:- 16:38:05.551 INFO: 76000 alignments done
ERROR:root:- 16:38:07.567 INFO: 78000 alignments done
ERROR:root:- 16:38:09.567 INFO: 80000 alignments done
ERROR:root:- 16:38:10.887 INFO: 82000 alignments done
ERROR:root:- 16:38:12.991 INFO: 84000 alignments done
ERROR:root:- 16:38:14.399 INFO: 86000 alignments done
ERROR:root:- 16:38:15.869 INFO: 88000 alignments done
ERROR:root:- 16:38:17.691 INFO: 90000 alignments done
ERROR:root:- 16:38:18.955 INFO: 92000 alignments done
ERROR:root:- 16:38:20.883 INFO: 94000 alignments done
ERROR:root:- 16:38:22.451 INFO: 96000 alignments done
ERROR:root:- 16:38:23.611 INFO: 98000 alignments done
ERROR:root:- 16:38:23.926 INFO: 98508 alignments done
ERROR:root:- 16:38:23.968 INFO: Alternative alignment: 1
ERROR:root:- 16:40:02.915 INFO: 98456 alignments done
ERROR:root:- 16:40:04.165 INFO: Alternative alignment: 2
ERROR:root:- 16:40:39.536 INFO: 22460 alignments done
ERROR:root:- 16:40:39.699 INFO: Alternative alignment: 3
ERROR:root:- 16:40:53.188 INFO: 6753 alignments done
ERROR:root:- 16:41:06.008 INFO: Premerge done
ERROR:root:- 16:41:06.087 INFO: Realigning 74539 HMM-HMM alignments using Maximum Accuracy algorithm
ERROR:root:- 17:45:23.434 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 17:54:38.666 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:20:38.682 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:21:54.506 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:22:22.677 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:22:34.270 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:23:26.213 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:23:37.095 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:23:37.392 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:24:37.948 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:24:39.597 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:25:39.570 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:25:49.885 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:26:13.272 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:26:29.900 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:27:15.154 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:27:37.047 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:28:19.271 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:28:33.264 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:28:36.489 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:28:51.406 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:29:45.612 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:30:16.568 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:30:28.793 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:30:31.831 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:53:37.048 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:54:00.896 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:54:11.610 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:54:14.370 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:54:16.348 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:54:31.863 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:54:48.719 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:55:43.895 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:56:12.632 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:56:32.998 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:57:39.539 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:57:43.100 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:58:04.449 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:58:36.037 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:59:07.097 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 18:59:28.371 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
ERROR:root:- 19:21:15.502 INFO: 74542 sequences belonging to 74542 database HMMs found with an E-value < 0.001
ERROR:root:- 19:21:15.502 INFO: Number of effective sequences of resulting query HMM: Neff = 11.5719
ERROR:root:- 19:21:21.061 INFO: Iteration 2
ERROR:root:- 19:21:21.061 INFO: Set premerge to 0! (premerge: 3 iteration: 2 hits.Size: 74539)
ERROR:root:- 19:21:21.184 INFO: Prefiltering database
ERROR:root:- 19:22:53.737 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment) : 1505027
ERROR:root:- 19:23:01.106 WARNING: Number of hits passing 2nd prefilter (reduced from 109391 to allowed maximum of 100000).
ERROR:root:You can increase the allowed maximum using the -maxfilt
HHblits seems to have some memory issues---there are a couple of open issues on the HH-suite GitHub about similar crashes. I don't think this in particular has anything to do with our tools. Perhaps running this one individually would help, though.
Thanks for the clarification. I followed your suggestion and created 1740 different Fasta File, I run the precompute_alignments.py on an instance with 96 CPUS with 600 GB RAM with the --no_tasks = 24 and --cpus_per_task = cpu_count() but after a while, the system crashed and the run is killed because it is out of memory. do you have any suggestions for me on how to run these 1740 fasta files? Should I decrease the --no_tasks = 24 ?
Again, it's probably an issue with HHblits, not our script. I recommend removing the offending FASTA files for now and running them individually later on.
could you please explain the relationship between increasing the --no_tasks and the RAM usage?
I want to know if I remove the offending FASTA files what is the best setting for the --no_tasks parameter to make sure that I don't get a problem with RAM again.
it seems that in each round the script calculates the --no_tasks protein_seq simultaneously and there is a possibility that some protein_seq took some time but the other threads open the other batch of FASTA files and it runs this way. could you give me some details on how the code runs and what is the role of --no_tasks parameter?
--no_tasks is just the number of threads that are created to run Hhblits. The threads share the same pool of RAM, so decreasing the number of threads does increase the amount of memory available to each one, yes. What I'm saying, however, is that these HHblits memory bugs don't really seem to be very responsive to the amount of memory available, which is why I recommended just removing the bad proteins for now. If you want, you can also try to reduce the number of tasks first.