foldseek
foldseek copied to clipboard
controlling memory usage
What is the best way to control memory usage for foldseek searches with big databases. In particular, I'm searching the full UniProt database with the following command:
foldseek easy-search "query.pdb" UniProt results.html tmpDir --threads 16 --split-memory-limit 40G --format-mode 3 --max-seqs 100 -e 0.001
And it is using 111 Gb of memory. Can you point me to which parameters would be most helpful for decreasing the memory usage without sacrificing too much speed? (running without --split-memory-limit 40G
has a memory usage of 144 G).
I'm running on a shared cluster, so it would be nice to be able to roughly predict memory usage, so I can request the appropriate amount of memory from the job scheduler.
You can use the new --prefilter-mode 1
, which is not memory limited. It quickly computes all possible ungapped alignments.
It should be especially helpful for single query searches, where foldseek was not able to use its multithreading capabilities.
Interestingly, when I run:
foldseek easy-search "query.pdb" UniProt results.html tmpDir --threads 16 --prefilter-mode 1 --split-memory-limit 20G --format-mode 3 --max-seqs 100 -e 0.001
The execution time halves from 1 hour to 30 minutes, but the memory usage seems to stay the same.
We utilize mmap for reading our ss database in prefilter mode 1. During this process, every page is brought into memory, but the system has the capability to eliminate pages at any time to conserve memory space. The task of managing this falls on the operating system. So, --split-memory-limit 20G will not help here.
Also 30 minutes seem to be too slow. Is it a single query? What system do you use?
It was two queries. Running on a Intel Xeon Gold 6230 CPU, and 400 Gb of RAM. It's an HPC system where the foldseek databases are stored on a network drive. Not sure the details of the storage, but it's definitely not a local SSD. Data transfer between the drive and the worker node could be speed limiting.
It's a shared system managed by a job scheduler. There was a node crash recently due to overdrawing the memory. I had a foldseek job running on the node at the time, using 140 Gb of RAM, and the admin thought it might have been partly responsible for the crash. So, I'm looking into ways to predict/manage foldseek memory usage, so I can add the predicted memory usage to my requests to the job scheduler.
I'm not too familiar with how mmap works, maybe it operates in a dynamic way that makes it so it would page onto the harddrive and never crash a node?