ncbi-genome-download High memory consumption (50 GB) for genbank bacteria

High memory consumption (50 GB) for genbank bacteria

Open openpaul opened this issue 4 years ago • 9 comments

I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way. I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble. I am launching it like this:

ncbi-genome-download -F fasta -s genbank     \
    --human-readable        \
    --retries 2     \
    --parallel 4    \
    --no-cache     \
    --verbose     \
    -o genomes     \
    bacteria

Apr 30 '20 15:04 openpaul

Hi, at what point is the memory usage up that high? I don't currently have the disk space or internet connection to really try downloading all of bacterial GenBank, but at least during the selection and MD5SUMS downloading step the memory usage is driven by your --parallel value and doesn't go above 2 GB for me with your input parameters.

May 10 '20 14:05 kblin

I observe memory issues as well. Here is what I did:

I wrote a Python script (using the Python interface) to download a list of custom genomes. All of this happened within one Python script.
This process was killed by the system due to excessive memory usage, which was evidenced by the kernel logs. (I have 16 GB of RAM plus some swap.)
~~The kill happened when trying to download something that wasn’t there (“No downloads matched your filter.”).~~ Moreover, when I used the same script to download only the genome during whose attempted download the kill occurred it worked fine and without an excessive memory usage.

All of this screams memory leak to me.

May 16 '20 08:05 Wrzlprmft

Running something similar again, I noticed that NCBI Genome Download appears to be spawning more Python processes, which each consume a bit of memory and amass over time. Their number exceeds the value of the parallel argument.

May 16 '20 18:05 Wrzlprmft

Hi, so I actually finished downloading genbank bacteria. In the successfull run it actually used "only" 20 Gb of memory in 4 threads, so 5 Gb per thread, which is reasonable given the number of files. Still I assume this could be lowered. Here a plot from mprofile:

Screenshot_2020-05-18_11-13-39

Edit: I am not sure why it used only 20 Gb of memory this time where last time it requested more than that (50Gb)

I looked at the code and it seems a strange to me to first collect all jobs and then work them off, instead of launching them directly (in core.py::config_download). List download_jobs is created and used once in the next loop. Maybe thats the issue? But I guess a more thorough code profiling has to be done to really find where the memory stacks up.

May 18 '20 09:05 openpaul

5 Gb per thread, which is reasonable given the number of files

Why would a downloading job require that much memory?

May 18 '20 10:05 Wrzlprmft

Its over 600.000 files, all of which require multiple informations in memory. So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable. Or does it not? Still, as mentioned, I think the logic could be improved upon, reducing memory consumption. Right now I don't have the time to suggest and test alternatives, but maybe in a week or two I might be able to.

May 18 '20 10:05 openpaul

So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable.

Not really. NCBI stores file information in assembly summaries, most of which are below 1 kB and even then, you only need a fraction of this for downloading. While these files can be much larger (Megabytes) for some species like E. coli, there is no reason to keep all of this information for all species in memory all the time.

May 18 '20 12:05 Wrzlprmft

ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there's this collect and execute split. If you don't use the tool to download all the things at once, the memory consumption is very reasonable, and that's what I've built it for. I'm happy to look at patches to restructure things, but it's not trivial and if you don't really need the filtering, a simpler script parsing the assembly_summary.txt file and then just downloading all those files might serve your needs better.

May 18 '20 15:05 kblin

I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like "get me all assemblies of genus Streptomyces that have an assembly level of 'complete' or 'chromosome' from GenBank", and not "download all 630000 bacterial genomes on the server".

May 18 '20 15:05 kblin

ncbi-genome-download ncbi-genome-download copied to clipboard

High memory consumption (50 GB) for genbank bacteria

ncbi-genome-download
ncbi-genome-download copied to clipboard