bamtools icon indicating copy to clipboard operation
bamtools copied to clipboard

Slow

Open hoangmit opened this issue 12 years ago • 4 comments

It is not a serious problem, but Bamtools is about 20% slower than the Python version aka pysam.

hoangmit avatar Apr 18 '12 17:04 hoangmit

Note that the pysam version is a actually a cython wrapper of the samtools API, which is written in C. Thus, pysam's speed comes from the fact that the vast majority of the work is done by C code, not Pyhon code.

arq5x avatar Apr 18 '12 17:04 arq5x

Ditto Aaron's reply - BamTools is completely independent of the samtools/pysam/picard codebases (with small exceptions in low-level, compression-related stuff that is similar by necessity).

Just curious - which operations are you noticing are slower? I'm happy to attempt optimizations when needed.

pezmaster31 avatar Apr 18 '12 17:04 pezmaster31

My experiment is very simple. I just enumerate through the reads in a bam file; for each read, I loop through each base.

If I use the original samtools C API, it takes like X seconds.

For bamtools' "GetNextAlignment", the times is about 2.6 X.

For pysam, It takes like 2.2 X seconds.

Note that if I use bamtools' "GetNextAlignmentCore", the time is only 1.1 X, but the sequence is not available. I guess there is a significant overhead in bamtools for the sequence extraction. (is it dynamic allocation or re-decompression ?)

hoangmit avatar Apr 18 '12 19:04 hoangmit

The slowdown comes in constructing the string fields. Pre-computing the 'AlignedBases' sequence (applying CIGAR to QueryBases to add insertions, padding, etc) is almost surely the biggest culprit.

If I had to do it over, I'd make the member fields accessible through getters and lazy-evaluate the expensive data. A bit late to do so now.

Edit: It's not compression-related; that's all done via the reader. It's purely building up the string objects themselves.

pezmaster31 avatar Apr 18 '12 19:04 pezmaster31