Potential optimization

Open yanj14jy15 opened this issue 9 months ago • 0 comments

Hi, I wonder if it's possible to add a deduplication step before calculating MSAs for colabfold. I noticed that when generating MSAs for a large batch of alphafold2-multimer-v3 analyses, there are quite some common proteins across different protein:protein pairs, and the MSAs of those common proteins got calculated repeatedly each time they appear. For example, if I duplicate proteinA:proteinB 1000 times, then colabfold will use mmseqs to calculate the MSA of proteinA 1000 times and the MSAs of proteinB 1000 times, while generating MSA for each of proteinA/B should suffice.

Additionally, when using mmseqs with multiple GPUs, I noticed that the precalculated indices will be loaded and split across GPUs. For a 80GB A100 or H100, the entire database can fit in one GPU pretty nicely. So I wonder if it's possible to adjust how the databases are loaded into GPU based on the size of GPU memory. For example, would it be possible to keep a copy of the database in each of the A100/H100 to reduce the communication time, especially if multiple GPUs are not connected by NVLINK? Thanks!

Mar 06 '25 05:03 yanj14jy15