pLM-BLAST
pLM-BLAST copied to clipboard
[QUESTION/SUGGESTION] Determining realistic use cases and limitations
Hello,
I'm coming back to pLM-BLAST after leaving it for a while. I really like the approach and wanted to see what improvements had been been made since October 2023. I'm still hoping to apply it to a very large data set (1000 proteomes) but need to confirm that it's realistically feasible.
I'm happy to say that the parallelization improvement has really helped. It had made the GPU compute time no-longer a limitation for me. (So, thank you!)
However, I'm now moving on to calculating the sizes of the embeddings, and it seems like these will get very large very quickly. My current estimate is that I will need at least 8TB for all my proteomes, Does this sound correct?
While I think I can come up with a temporary storage solution on our HPC, I am now wondering how this would then affect query times and memory? How to these query resources scale in relation to the size/number of embeddings?
Perhaps, it would be useful if the documentation could breakdown a few use cases and what are realistic compute times (GPU and CPU) along with RAM requirements for processing...