bamtools
bamtools copied to clipboard
Slow
It is not a serious problem, but Bamtools is about 20% slower than the Python version aka pysam.
Note that the pysam version is a actually a cython wrapper of the samtools API, which is written in C. Thus, pysam's speed comes from the fact that the vast majority of the work is done by C code, not Pyhon code.
Ditto Aaron's reply - BamTools is completely independent of the samtools/pysam/picard codebases (with small exceptions in low-level, compression-related stuff that is similar by necessity).
Just curious - which operations are you noticing are slower? I'm happy to attempt optimizations when needed.
My experiment is very simple. I just enumerate through the reads in a bam file; for each read, I loop through each base.
If I use the original samtools C API, it takes like X seconds.
For bamtools' "GetNextAlignment", the times is about 2.6 X.
For pysam, It takes like 2.2 X seconds.
Note that if I use bamtools' "GetNextAlignmentCore", the time is only 1.1 X, but the sequence is not available. I guess there is a significant overhead in bamtools for the sequence extraction. (is it dynamic allocation or re-decompression ?)
The slowdown comes in constructing the string fields. Pre-computing the 'AlignedBases' sequence (applying CIGAR to QueryBases to add insertions, padding, etc) is almost surely the biggest culprit.
If I had to do it over, I'd make the member fields accessible through getters and lazy-evaluate the expensive data. A bit late to do so now.
Edit: It's not compression-related; that's all done via the reader. It's purely building up the string objects themselves.