modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Call mods is too slow

Open ArtRand opened this issue 7 months ago • 2 comments

It's come to my attention that modkit modbam call-mods is too slow, especially when all-context mod tags are present. Indeed I've been able to observe this annoyance. A fix or explanation is coming.

ArtRand avatar May 02 '25 23:05 ArtRand

I was finally getting around to writing my little test case, and then I saw your comment. But I'll include it anyway in case it helps.

Test data:

wget https://s3.kopah.orci.washington.edu/stergachis/public/Mitchell/temp/call-mod-test.bam

Command I am running:

time modkit call-mods -t 16 -p 0.1 call-mod-test.bam /dev/null  && time samtools view -c call-mod-test.bam

Results:

> in the next version of modkit this command will be `modkit modbam call-mods`
> attempting to sample 10042 reads
> done, 11539 records processed

real    0m37.636s
user    1m20.017s
sys     0m1.361s


11539

real    0m1.283s
user    0m1.224s
sys     0m0.052s

This shows that call-mods with 16 threads is about 30 times slower than samtools view -c with 1 thread. Obviously, simply counting will be faster, but I didn't think that much faster.

It also seems to get worse the more data I include, but I haven't validated that. This works out to ~10,000 reads every 30 seconds, which is about 4 hours of wall clock time for a five-million-read dataset.

5_000_000/10_000 *30 /3600
4.166666666666667

It is not impossibly slow, but a speed up would help!

mrvollger avatar Jun 13 '25 21:06 mrvollger

Hey @mrvollger, Thanks for the test data! I'll get to this once I get a POC of the new thresholding technique* out.

  • Idea is to make a threshold per genomic position.

ArtRand avatar Jun 15 '25 14:06 ArtRand