Call mods is too slow
It's come to my attention that modkit modbam call-mods is too slow, especially when all-context mod tags are present. Indeed I've been able to observe this annoyance. A fix or explanation is coming.
I was finally getting around to writing my little test case, and then I saw your comment. But I'll include it anyway in case it helps.
Test data:
wget https://s3.kopah.orci.washington.edu/stergachis/public/Mitchell/temp/call-mod-test.bam
Command I am running:
time modkit call-mods -t 16 -p 0.1 call-mod-test.bam /dev/null && time samtools view -c call-mod-test.bam
Results:
> in the next version of modkit this command will be `modkit modbam call-mods`
> attempting to sample 10042 reads
> done, 11539 records processed
real 0m37.636s
user 1m20.017s
sys 0m1.361s
11539
real 0m1.283s
user 0m1.224s
sys 0m0.052s
This shows that call-mods with 16 threads is about 30 times slower than samtools view -c with 1 thread. Obviously, simply counting will be faster, but I didn't think that much faster.
It also seems to get worse the more data I include, but I haven't validated that. This works out to ~10,000 reads every 30 seconds, which is about 4 hours of wall clock time for a five-million-read dataset.
5_000_000/10_000 *30 /3600
4.166666666666667
It is not impossibly slow, but a speed up would help!
Hey @mrvollger, Thanks for the test data! I'll get to this once I get a POC of the new thresholding technique* out.
- Idea is to make a threshold per genomic position.