htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Bgzip text mode (try 2)

Open jkbonfield opened this issue 1 year ago • 0 comments

An alternative to @mlin's #1369 for consideration.

The two implementations differ by:

  • Using kstring's kgetline3 (a new variant of kgetline2), vs manual newline scanning.
  • Efficiency (my version has considerably less overhead)
  • Code complexity (my version adds around twice as many new lines of code)
  • Text mode explicit (Mike's) vs text mode implicit (--binary to disable).
  • IO block sizes are freer (Mike's) vs required to exactly match the BGZF block size (mine).

At bgzf -l1 Mike's text mode added 6.2% CPU cycles, vs +0.4% for mine. However at level 6 these are considerably reduced as the Deflate overhead starts to dominate and these drop to 2.7% vs 0.0%. So maybe this isn't such an issue anyway. The main difference in CPU comes from searching backwards for the first newline vs searching forwards for the last newline. On long read technologies these two implementations start to converge a bit, but kgetline still has additional overhead with extra memcpys.

I also found the original implementation wasn't entirely foolproof when dealing with headers that start with lines over 64KB long. This is admittedly an extreme corner case, but it does mean the extra complexity of the new code may be doing something more anyway.

jkbonfield avatar Aug 11 '22 14:08 jkbonfield