Chunking for uBAM

Open rhpvorderman opened this issue 1 year ago • 2 comments

Currently chunking only works for FASTQ. See https://github.com/marcelm/cutadapt/issues/811

Oct 07 '24 14:10 rhpvorderman

Oh, interesting, I guess this needs to be done on the bgzip-level?

Oct 07 '24 14:10 marcelm

No, not really. Bgzip is just concatenated gzips. There is no requirement for the bgzips to be split at the bam record level. A bam record can start in one block and end in another, even if it could fit entirely in a block of its own. Nanopore records often will exceed the maximum size of a bgzip block.

So we can just decompress the whole thing as one big filestream and parse the records out. We already do this for single-end. For chunking we can make use of the fact that BAM records store their block sizes at the beginning. So there is no need to read the entire block. Chunking should be much faster than for FASTQ.

Oct 07 '24 17:10 rhpvorderman