pileup on (large) metagenome - parallelization
Recently, I've been running modkit pileup with default settings on a metagenomic promethion run (130Gb).
The assembled metagenome consists of about 400Mbp of contigs, ranging in size from a couple 1000bp to 5Mbp, with coverage ranging from 1X to 1000X or more.
The modkit pileup is the bottleneck in my pipeline, requiring very high memory (peak about 350GB), and very long compute times (e.g. >12h on 24 cores).
In the documentation it's stated that --interval-size, --sampling-interval-size, and --chunk-size can be modified to improve parallelism.
What would be the best settings for my usecase?
Thanks!
Bram
Hello @brambloemen,
Could you try --interval-size 50000 and --chunk-size 12 (assuming you have also set --threads 24). You may also set --queue-size 500 or less (with or without the above adjustments).
Which base modification models have you run on these reads?
From a quick look in the comments section, this issue has one comment but I can't access it directly from GitHub's API. Let's check issue #386 which is about region specifications in modkit pileup, as it might be related to your current problem.
Hello @brambloemen,
Could you try
--interval-size 50000and--chunk-size 12(assuming you have also set--threads 24). You may also set--queue-size 500or less (with or without the above adjustments).Which base modification models have you run on these reads?
Hi @ArtRand , I have run the 4mC_5mC and 6mA models: [email protected]_4mC_5mC@v3 and [email protected]_6mA@v3
Also, I'm a bit confused about the chunk-size 12, as the documentation states:
will set this value to 1.5x the number of threads specified, so if 4
threads are specified the chunk_size will be 6. A warning will be
shown if this option is less than the number of threads specified
I did a quick test with different parameters on a single genome consisting of a large 4.4Mbp and some smaller (17Kbp, 6Kbp) contigs.