noodles icon indicating copy to clipboard operation
noodles copied to clipboard

cram/writer: Add CRAM 3.1 support

Open zaeleus opened this issue 1 year ago • 0 comments

This is a tracking issue for CRAM 3.1 write support.

The CRAM 3.1 format is still a draft, but there is interest in having noodles-cram demonstrate a second implementation.

The format is structurally the same as 3.0, but 3.1 includes additional block compression methods: rANS Nx16 (5), adaptive arithmetic coder (6), fqzcomp (7), and name tokenizer (8). noodles' CRAM reader supports decoding these compression methods already.

There is currently no way to apply user-defined compression methods for any particular data series. The implementation now simply uses gzip (DEFLATE) to compress all block data. As part of this work, a data series-compression method map must also be created to allow overriding the default encoders. Not only would this allow greater flexibility, but it would provide the framework to build selectable presets a la htslib's/samtools' CRAM compression profiles.

  • [x] cram/writer/num/vlq: Add variable-length quantity (uint7) writer
  • [x] cram/codecs: Add rANS Nx16 encoder
    • [x] Order-0 encoder
    • [x] Order-1 encoder
    • [x] Stripe encoder
    • [x] RLE encoder
    • [x] Bit packing encoder
  • [x] cram/codecs: Add adaptive arithmetic coding encoder
    • [x] Range coder encoder
    • [x] Statistical model encoder
    • [x] Order-0 encoder
    • [x] Order-1 encoder
    • [x] External (bzip2) encoder
    • [x] Stripe encoder
    • [x] Order-0 RLE encoder
    • [x] Order-1 RLE encoder
    • [x] Bit packing encoder
  • [ ] cram/codecs: Add fqzcomp encoder
  • [ ] cram/codecs: Add name tokenizer encoder
  • [x] cram/codecs: Add encoder options
  • [x] cram/data_container: Add block content-encoder map
  • [ ] cram/writer: Allow overriding block compression methods for data series
  • [ ] cram/writer: Allow overriding CRAM format version

zaeleus avatar Aug 20 '22 21:08 zaeleus