modkit extract takes very long
Hello,
im using modkit extract like this:
modkit extract \
--reference <fastafile> \
--include-bed <bedfile> \
--threads 16 \
--log-filepath <logfile> \
<input> \
<output>
However, it runs for a very long time (7 hours and more for ~4GB bam files).
I see that modkit uses more cores sometimes, but just for a short period of time and then it runs only on one core for most of the time.
The BED file that I am using contains the Illumina 850k array positions, so ~850k lines.
The BAM files are sorted, indexed and were generated using modkit adjust-mods --convert h m on the original dorado call BAM files.
What could be here the issue? Shouldn't it extract the modcalls much faster?
Thank you for your response.
Hello @pkerbs,
Could you tell me what version you are using? Version v0.2.2 had a large performance regression that was fixed in v0.2.3 onwards.
Ah yes, sorry, forgot to mention.
I am using version 0.2.4
Hello @pkerbs,
I see. The problem is probably due to the filtering algorithm, it is not optimized for many small regions like you have in your BED file. I appreciate you reporting this use case and the slow down - I can have a fix in (hopefully) the next release or certainly the one following. We're working on some performance improvements all around. If you can get by without all of the read-level detail in extract, the algorithm in pileup should be much more efficient. Alternatively, run modkit extract on the whole modBAM and do the inner join on your 850k sites afterward. Sorry for the inconvenience, I'll let you know when I have a solution to this problem.
Hi @ArtRand, thank you very much for your quick assessment and your work on that. For now, I will use your recommendations then.
Just following up on the same issue. Using v0.3.0 and modkit extract with --include-bed is very sluggish (1h for a couple tenths of bed lines). It's actually way faster to run mod kit extract with --region multiple times and then combine outputs.
Is there some update on the issue?
Hello @ppapasaikas,
I think you're hitting the known issue from above. What is the --include-bed you're using like? (i.e how many intervals, how long are they on average?). To get the most out of extract, instead of processing an entire modBAM - grab just the reads you want. Sounds like you've more or less arrived at this approach yourself.For clarity, the "intersect" algorithm in extract is not ideal for many small intervals.
Hi @pkerbs and @ppapasaikas , could you let me know how you installed or built modkit? cargo build may generate a debug build that runs slow. On version 0.3.2 with cargo build --release or cargo install I could not replicate the performance issue. Thanks.