alphafold
alphafold copied to clipboard
Faster MSA computations with chunked DBs on Colab
There is a stark difference in the time taken to compute MSA on Colab vs through AlphaFold's actual implementation through docker. I was trying to figure out the differences, and the most obvious seems to be that jackhmmer runs on chunks of the original dbs.
I have the whole dataset downloaded with me already, and I find that for the proteins I tested, there are almost the same number of sequences found in both approaches. However, in the Colab version MSA takes merely 20-30 min and through the docker it takes 7-9 (mostly jackhmmer on uniref90 and mgnify, and even when I get a total of <200 sequence matches for a protein with only 77 amino acids)
I am running on AWS notebook instance ml.g4dn.4xlarge, all the data being on the SSD.
I have some questions:
- Why is the 7-9 hour approach suggested?
- Is this the only reason MSA computation is so fast on Colab?
- Might there be something wrong with my implementation if jackhmmer on Colab is taking less time than local implementation?
- jackhmmer chunking currently is only supported over internet, will it work as fast/faster if it is implemented locally (by myself)?
Here are relevant timing logs for comparison. I believe jackhmmer is the slowest, and I have ran 3-4 proteins through to get similar time benchmarks.
I0419 09:01:46.545762 140559696049984 run_docker.py:255] I0419 09:01:46.545098 140281249204032 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0419 13:09:40.188118 140559696049984 run_docker.py:255] I0419 13:09:40.186011 140281249204032 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 14873.641 seconds
I0419 13:09:40.283886 140559696049984 run_docker.py:255] I0419 13:09:40.283068 140281249204032 utils.py:36] Started Jackhmmer (mgy_clusters_2018_12.fa) query
I0419 17:24:10.802730 140559696049984 run_docker.py:255] I0419 17:24:10.802150 140281249204032 utils.py:40] Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 15270.519 seconds
I0419 17:24:11.070103 140559696049984 run_docker.py:255] I0419 17:24:11.069362 140281249204032 utils.py:36] Started HHsearch query
I0419 17:31:10.780151 140559696049984 run_docker.py:255] I0419 17:31:10.779571 140281249204032 utils.py:40] Finished HHsearch query in 419.710 seconds
I0419 17:31:11.022377 140559696049984 run_docker.py:255] I0419 17:31:11.021669 140281249204032 utils.py:36] Started HHblits query
I0419 18:09:46.100788 140559696049984 run_docker.py:255] I0419 18:09:46.100168 140281249204032 utils.py:40] Finished HHblits query in 2315.078 seconds
I0419 18:10:14.906161 140559696049984 run_docker.py:255] I0419 18:10:14.905608 140281249204032 pipeline.py:234] Uniref90 MSA size: 6040 sequences.
I0419 18:10:14.906352 140559696049984 run_docker.py:255] I0419 18:10:14.905785 140281249204032 pipeline.py:235] BFD MSA size: 1328 sequences.
I0419 18:10:14.906456 140559696049984 run_docker.py:255] I0419 18:10:14.905830 140281249204032 pipeline.py:236] MGnify MSA size: 248 sequences.
I0419 18:10:14.906560 140559696049984 run_docker.py:255] I0419 18:10:14.905873 140281249204032 pipeline.py:238] Final (deduplicated) MSA size: 6010 sequences.
As far as I can tell, Colab is not using jackhmmer but MMseqs2