htslib
htslib copied to clipboard
Add an inline version of bgzf_read for small reads.
The bgzf_read function is long and not something that's likely to get inlined even if it was in the header.
However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot.
In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16
develop:
real 0m48.158s
user 6m2.901s
sys 0m28.134s
real 0m48.730s
user 6m3.707s
sys 0m28.473s
real 0m48.653s
user 6m5.215s
sys 0m28.637s
This PR:
real 0m41.731s
user 5m59.780s
sys 0m30.393s
real 0m41.945s
user 6m0.367s
sys 0m30.426s
So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.
/usr/bin/time was reporting ~810% CPU utilisation for develop and ~940% for this PR, demonstrating the reduced time in main thread.
Commenting out the sanity checks in bam_read1
for if (c->n_cigar > 0)
onwards increases that CPU utilisation to 1060% (1120% if I get rid of the tag2cigar call), leading to 37s real time. This shows that there is room for improvement, as these sort of checks ought to be executed in the multi-threaded decode instead. We thread the bgzf_read and not the bam_read.
This is why multi-threaded decoding of SAM and SAM.gz is sometimes more performant than BAM, despite the string->integer conversion overheads, as SAM parsing is entirely threadable.