htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Zstd Support

Open reedacartwright opened this issue 7 years ago • 19 comments

An undergrad is the lab is looking into how modern compression algorithms work with genomic data. Long story short, it looks like there are justifications to use LZ4, ZSTD, and Brotli in different cases.

If we produced a pull request to add support for these libraries to htslib, maybe initially to cram, would it be likely to be accepted?

reedacartwright avatar May 05 '17 20:05 reedacartwright

I have also observed good performance on ZSTD, although it appears that the CRAM compressor is also a very modern approach.

See my issue here on the "unifying compressor API" squash and its related compression benchmark:

https://github.com/quixdb/squash-benchmark/issues/35

kyleabeauchamp avatar May 05 '17 20:05 kyleabeauchamp

In particular, I was able to convince the squash benchmark folks to add the genome FASTA data to their compressor benchmark. I then ran the benchmark on chr1 and observed very good compression and speed for zstd at the highest (21) compression level:

https://gist.github.com/kyleabeauchamp/e5f5d79aa153bc85d854a705a25c9166

kyleabeauchamp avatar May 05 '17 20:05 kyleabeauchamp

@kyleabeauchamp My student has been using a modified version of lzbench to look at compression of FASTA, VCF, GTF, etc. I'll have her look at squash as well.

A programmer in the lab has produced a version of htslib that uses zstd instead of zlib, relying on the wrapper provided by zstd to port the code. https://github.com/zmertens/htslib/tree/zstd Ideally, we would like to have something in which zstd and zlib work together.

reedacartwright avatar May 05 '17 20:05 reedacartwright

Cool. I'm not actually an HTSLib developer :) but I wanted to voice my support for any project that improves the compression ratio and/or read throughput of HTSLib-supported formats (particularly BAM and CRAM).

kyleabeauchamp avatar May 05 '17 20:05 kyleabeauchamp

I've experimented with Zstd and CRAM before. It's a definite win over zlib, at some compression levels both in speed and size. I experimented on this in Staden io_lib for the original Scramble implementation of CRAM (https://github.com/jkbonfield/io_lib). There is an earlier Zstd branch there but also a cram_modules branch which has lots of other experimental plugins.

The problem isn't adding Zstd, which is actually very easy, but being able to put together a group of changes that jointly justify a change to the specification. We don't want to do this piecemeal with lots of variations of specification, but a significant step forward as we did between CRAM v2 and v3. I have various other experiments on the go and in particular have a significant improvement for compression of query names (beating everything pretty much bar the excessively slow context mixing algorithms).

If you're interested in compression and looking at other formats then you may also want to check out https://github.com/IlyaGrebnov/libbsc. On some CRAM data series that's the winner out of the general purpose tools.

LZ4 has potential use though within samtools without needing to change any specifications. It's ideal for temporary file output, eg when spooling the blocks out in a samtools sort process.

jkbonfield avatar May 07 '17 08:05 jkbonfield

Regarding FASTA compression, it's really a special case and not so helpful here. Also note that the higher levels mainly work better by using larger block sizes, harming random access. SAM & BAM sequence encoding is predominantly about how efficient the LZ step is as a genome sorted file will have lots of overlap from one sequence to the next. CRAM is (usually) referenced based, so the efficient is then down to how well the deltas get encoded. Zstd doesn't buy that much there really.

I experimented with column oriented compression of VCF (so similar to how CRAM is vs BAM). It gains a bit, but not as much as I'd hoped for. There are other reasons why that may be good to do though, but this would be a larger project as it's really then designing a new format from scratch.

jkbonfield avatar May 07 '17 08:05 jkbonfield

Thanks, this is helpful information in terms of the challenges and also what's been tried. I suspected someone did a lot of benchmarking at some point.

kyleabeauchamp avatar May 07 '17 16:05 kyleabeauchamp

I think one could get a lot out of using Zstd over Zlib due to is faster compression and decompression speed.

Here is a summary of our results for bcf, vcf, sam, gff3, gtf, and fasta. I think it is clear that if your problem is limited by compression/decompression speed that Zstd is more efficient. For the same amount of compression, Zstd is faster, and for the same amount of time Zstd compresses more. Zstd is also more controllable allowing for a wider range of values.

As you say @jkbonfield , Enabling random access will probably reduce these advantages, but I poked around in the Zstd source code, and each compression level means is adjusted based on the size of the source block, and they have levels for blocks below 16KB.

fig-results-0 Compression speed vs Compressed Size. Note that here the y-axis is compressed size / original size * 100

We have other metrics based on transmitting a dataset over a 'pipe' that include compression size, compression speed, and decompression speed. In those, zstd and lz4 do well depending on the 'pipe' size because of their fast compression and decompression speed.

reedacartwright avatar May 08 '17 21:05 reedacartwright

The pipe transfer point is an interesting one as if you're only talking about data transfer then you'll be willing to use any format as it's temporary. However if it's a process of BAM->BAM.raw->BAM+zstd -> ... -> BAM.raw -> BAM (deflate) then that's possibly a lot of compression and uncompression steps. It's better then to just use something "good enough" like CRAM directly without needing to transcode twice for the transmission step.

Some examples from my scramble experiments:

$ awk '{a+=$6} END {print a}' NA12878_S1.1.orig.cram.10k.size 
6.64126e+10
$ awk '{a+=$6} END {print a}' NA12878_S1.1.bam.fqz+zstd.10k.cram.size
6.23181e+10
$ awk '{a+=$6} END {print a}' NA12878_S1.1.bam.fqz+zstd.100k.cram.size
5.96217e+10

Here 10k and 100k refer to 10,000 and 100,000 reads per cram slice; so the granularity of random access. Adding zstd and fqzcomp (quality) shaved off around 3% unless we want to get coarser grained on random access. It's OK, but by itself isn't anything ground breaking.

The .cram.size files were outputs from the cram_size tool. Examples of the original and zstd.10k one are:

$ cat NA12878_S1.1.orig.cram.size
Block CORE              , total size           0
Block content_id      11, total size  9291477075 g     RN
Block content_id      12, total size 50497889145     R QS
Block content_id      13, total size    10619884 g     IN
Block content_id      14, total size   316666117 g  rR SC
Block content_id      15, total size   453510486     R BF
Block content_id      16, total size   218338526    r  CF
Block content_id      17, total size   497168802    r  AP
Block content_id      19, total size   128590411    rR MQ
Block content_id      20, total size     8716899    r  NS
Block content_id      21, total size    17488877 g   R MF
Block content_id      22, total size   154859102 g   R TS
Block content_id      23, total size   188220982 g     NP
Block content_id      24, total size   777242598 g     NF
Block content_id      26, total size   219022726    rR FN
Block content_id      27, total size    85217034    r  FC
Block content_id      28, total size   529726388 g   R FP
Block content_id      29, total size     7825109 g     DL
Block content_id      30, total size  1784256524 g     BA
Block content_id      31, total size   127139382 g  rR BS
Block content_id      32, total size   123102176    rR TL
Block content_id 4279619, total size   117877737    rR AMC
Block content_id 5063770, total size         496 g     MDZ
Block content_id 5131587, total size         139 g     NMC
Block content_id 5459267, total size    95402099    rR SMC
Block content_id 5779523, total size    43436123    r  X0C
Block content_id 5779529, total size        1584 g     X0I
Block content_id 5779539, total size     7159543 g     X0S
Block content_id 5779779, total size    63076858 g  rR X1C
Block content_id 5779785, total size          52 g     X1I
Block content_id 5779795, total size     9021181 g     X1S
Block content_id 5783898, total size   279421941 g     XAZ
Block content_id 5784387, total size    53329435 g     XCC
Block content_id 5785411, total size    36359075    r  XGC
Block content_id 5786947, total size   188586224    rR XMC
Block content_id 5787203, total size        4936 g     XNC
Block content_id 5787459, total size    31961173    r  XOC
Block content_id 5788737, total size    49860294    r  XTA

and

$ cat NA12878_S1.1.bam.fqz+zstd.10k.cram.size 
Block CORE              , total size           0
Block content_id      11, total size  9123818505      [Z] RN
Block content_id      12, total size 46964081560      [q] QS
Block content_id      13, total size     6800171      [Z] IN
Block content_id      14, total size   316582561 g   R[Z] SC
Block content_id      15, total size   453510486     R BF
Block content_id      16, total size   218338526    r  CF
Block content_id      17, total size   497168802    r  AP
Block content_id      19, total size   128581699    rR MQ
Block content_id      20, total size     7979514    r [Z] NS
Block content_id      21, total size    17488877 g   R MF
Block content_id      22, total size   105277278     R[Z] TS
Block content_id      23, total size   122722059      [Z] NP
Block content_id      24, total size   786520929      [Z] NF
Block content_id      26, total size   219025326    rR FN
Block content_id      27, total size    85217034    r  FC
Block content_id      28, total size   501211157 g    [Z] FP
Block content_id      29, total size     7630147      [Z] DL
Block content_id      30, total size  1594899103 g    [Z] BA
Block content_id      31, total size   127143928 g  rR[Z] BS
Block content_id      32, total size   127375184    rR TL
Block content_id 4279619, total size   117875360    rR AMC
Block content_id 5063770, total size         491      [Z] MDZ
Block content_id 5131587, total size         148 g    [Z] NMC
Block content_id 5459267, total size    95402117    rR SMC
Block content_id 5779523, total size    26177637    r [Z] X0C
Block content_id 5779529, total size        3998      [Z] X0I
Block content_id 5779539, total size     6603397 g    [Z] X0S
Block content_id 5779779, total size    63077067 g  rR[Z] X1C
Block content_id 5779785, total size          39      [Z] X1I
Block content_id 5779795, total size     9070776 g    [Z] X1S
Block content_id 5783898, total size   228438521      [Z] XAZ
Block content_id 5784387, total size    53329470 g     XCC
Block content_id 5785411, total size    36359075    r  XGC
Block content_id 5786947, total size   188585670    rR XMC
Block content_id 5787203, total size        5976 g    [Z] XNC
Block content_id 5787459, total size    31961173    r  XOC
Block content_id 5788737, total size    49860294    r  XTA

The various g, r and R in the first one refer to gzip (deflate), rANS order-0 and rANS order-1 codecs. The second example has [q] and [Z] which is how the plugins referred to their codec, for fqzcomp quality codec and Zstd respectively. I can't recall which compression level I was using though.

Basically most of the space savings are coming from RN (read names), QS (quality scores), BA (literal bases - likely from the unmapped reads at the end of the file) and XA:Z aux tag.

I already have a better read name codec that beats zstd anyway, plus qualities aren't compressed well by that scheme either. See https://github.com/jkbonfield/mpeg_misc/blob/master/CE5/RESULTS.md for some tokenisation and compression of the first 10,000 read names (1 cram block) from a variety of test files covering multiple platforms. Libbsc in other tests does better still. Eg on SRR1238539 I get for RN and XA:Z I get:

Block content_id      11, total size   529026032 g     RN
Block content_id      11, total size   245988043      [Z] RN
Block content_id      11, total size   217363907      [Z][b] RN

Block content_id 5783898, total size   242569444 g     XAZ
Block content_id 5783898, total size   234564415      [Z] XAZ
Block content_id 5783898, total size   191399854      [Z][b] XAZ

This really leaves auxiliary blocks and maybe for unmapped data BA fields as the key source of improvement (for CRAM anyway).

With BAM and BCF it's a totally different scenario that I haven't really investigated. As a general purpose codec though libbsc really is quite impressive and undervalued IMO.

Edit: and some more stats from pure CRAM vs BSC+FQZ on the platinum genomes sample:

-rw-r--r-- 1 jkb team117 66611124096 Dec 13 09:34 NA12878_S1.10k.cram
-rw-r--r-- 1 jkb team117 65641067471 Dec 13 13:07 NA12878_S1.s100k.cram
-rw-r--r-- 1 jkb team117 57767693482 Dec 13 16:24 NA12878_S1.100k.bsc_fqz.cram

That's more significant - 12% boost - but needs large block sizes to work well.

jkbonfield avatar May 09 '17 11:05 jkbonfield

Have you ever experimented with the pre calculated dictionary abilities of zstd or brotli? I imagine that for some file types, including a pre calculated dictionary along with htslib/bgzip would improve ratios.

On May 9, 2017 04:44, "James Bonfield" [email protected] wrote:

The pipe transfer point is an interesting one as if you're only talking about data transfer then you'll be willing to use any format as it's temporary. However if it's a process of BAM->BAM.raw->BAM+zstd -> ... -> BAM.raw -> BAM (deflate) then that's possibly a lot of compression and uncompression steps. It's better then to just use something "good enough" like CRAM directly without needing to transcode twice for the transmission step.

Some examples from my scramble experiments:

$ awk '{a+=$6} END {print a}' NA12878_S1.1.orig.cram.10k.size 6.64126e+10 $ awk '{a+=$6} END {print a}' NA12878_S1.1.bam.fqz+zstd.10k.cram.size 6.23181e+10 $ awk '{a+=$6} END {print a}' NA12878_S1.1.bam.fqz+zstd.100k.cram.size 5.96217e+10

Here 10k and 100k refer to 10,000 and 100,000 reads per cram slice; so the granularity of random access. Adding zstd and fqzcomp (quality) shaved off around 3% unless we want to get coarser grained on random access. It's OK, but by itself isn't anything ground breaking.

The .cram.size files were outputs from the cram_size tool. Examples of the original and zstd.10k one are:

$ cat NA12878_S1.1.orig.cram.size Block CORE , total size 0 Block content_id 11, total size 9291477075 g RN Block content_id 12, total size 50497889145 R QS Block content_id 13, total size 10619884 g IN Block content_id 14, total size 316666117 g rR SC Block content_id 15, total size 453510486 R BF Block content_id 16, total size 218338526 r CF Block content_id 17, total size 497168802 r AP Block content_id 19, total size 128590411 rR MQ Block content_id 20, total size 8716899 r NS Block content_id 21, total size 17488877 g R MF Block content_id 22, total size 154859102 g R TS Block content_id 23, total size 188220982 g NP Block content_id 24, total size 777242598 g NF Block content_id 26, total size 219022726 rR FN Block content_id 27, total size 85217034 r FC Block content_id 28, total size 529726388 g R FP Block content_id 29, total size 7825109 g DL Block content_id 30, total size 1784256524 g BA Block content_id 31, total size 127139382 g rR BS Block content_id 32, total size 123102176 rR TL Block content_id 4279619, total size 117877737 rR AMC Block content_id 5063770, total size 496 g MDZ Block content_id 5131587, total size 139 g NMC Block content_id 5459267, total size 95402099 rR SMC Block content_id 5779523, total size 43436123 r X0C Block content_id 5779529, total size 1584 g X0I Block content_id 5779539, total size 7159543 g X0S Block content_id 5779779, total size 63076858 g rR X1C Block content_id 5779785, total size 52 g X1I Block content_id 5779795, total size 9021181 g X1S Block content_id 5783898, total size 279421941 g XAZ Block content_id 5784387, total size 53329435 g XCC Block content_id 5785411, total size 36359075 r XGC Block content_id 5786947, total size 188586224 rR XMC Block content_id 5787203, total size 4936 g XNC Block content_id 5787459, total size 31961173 r XOC Block content_id 5788737, total size 49860294 r XTA

and

$ cat NA12878_S1.1.bam.fqz+zstd.10k.cram.size Block CORE , total size 0 Block content_id 11, total size 9123818505 <(912)%20381-8505> [Z] RN Block content_id 12, total size 46964081560 [q] QS Block content_id 13, total size 6800171 [Z] IN Block content_id 14, total size 316582561 g R[Z] SC Block content_id 15, total size 453510486 R BF Block content_id 16, total size 218338526 r CF Block content_id 17, total size 497168802 r AP Block content_id 19, total size 128581699 rR MQ Block content_id 20, total size 7979514 r [Z] NS Block content_id 21, total size 17488877 g R MF Block content_id 22, total size 105277278 R[Z] TS Block content_id 23, total size 122722059 [Z] NP Block content_id 24, total size 786520929 [Z] NF Block content_id 26, total size 219025326 rR FN Block content_id 27, total size 85217034 r FC Block content_id 28, total size 501211157 g [Z] FP Block content_id 29, total size 7630147 [Z] DL Block content_id 30, total size 1594899103 g [Z] BA Block content_id 31, total size 127143928 g rR[Z] BS Block content_id 32, total size 127375184 rR TL Block content_id 4279619, total size 117875360 rR AMC Block content_id 5063770, total size 491 [Z] MDZ Block content_id 5131587, total size 148 g [Z] NMC Block content_id 5459267, total size 95402117 rR SMC Block content_id 5779523, total size 26177637 r [Z] X0C Block content_id 5779529, total size 3998 [Z] X0I Block content_id 5779539, total size 6603397 g [Z] X0S Block content_id 5779779, total size 63077067 g rR[Z] X1C Block content_id 5779785, total size 39 [Z] X1I Block content_id 5779795, total size 9070776 g [Z] X1S Block content_id 5783898, total size 228438521 [Z] XAZ Block content_id 5784387, total size 53329470 g XCC Block content_id 5785411, total size 36359075 r XGC Block content_id 5786947, total size 188585670 rR XMC Block content_id 5787203, total size 5976 g [Z] XNC Block content_id 5787459, total size 31961173 r XOC Block content_id 5788737, total size 49860294 r XTA

The various g, r and R in the first one refer to gzip (deflate), rANS order-0 and rANS order-1 codecs. The second example has [q] and [Z] which is how the plugins referred to their codec, for fqzcomp quality codec and Zstd respectively. I can't recall which compression level I was using though.

Basically most of the space savings are coming from RN (read names), QS (quality scores), BA (literal bases - likely from the unmapped reads at the end of the file) and XA:Z aux tag.

I already have a better read name codec that beats zstd anyway, plus qualities aren't compressed well by that scheme either. See https://github.com/jkbonfield/mpeg_misc/blob/master/CE5/RESULTS.md for some tokenisation and compression of the first 10,000 read names (1 cram block) from a variety of test files covering multiple platforms. Libbsc in other tests does better still. Eg on SRR1238539 I get for RN and XA:Z I get:

Block content_id 11, total size 529026032 g RN Block content_id 11, total size 245988043 [Z] RN Block content_id 11, total size 217363907 [Z][b] RN

Block content_id 5783898, total size 242569444 g XAZ Block content_id 5783898, total size 234564415 [Z] XAZ Block content_id 5783898, total size 191399854 [Z][b] XAZ

This really leaves auxiliary blocks and maybe for unmapped data BA fields as the key source of improvement (for CRAM anyway).

With BAM and BCF it's a totally different scenario that I haven't really investigated. As a general purpose codec though libbsc really is quite impressive and undervalued IMO.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htslib/issues/530#issuecomment-300138734, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHgg2-VdlwdylUrV5f2rIjAiOmTCOks5r4FF_gaJpZM4NSZ_y .

reedacartwright avatar May 09 '17 16:05 reedacartwright

No I haven't, nor have I looked into how best to create a dictionary. I expect it wouldn't make much difference except for small block sizes (ie when supporting a high degree of random access). In that situation I think it may well help rescue a lot of the file growth.

I've thought about this for the pure entropy encoders though. Eg order-1 rANS works well on qualities but becomes a costly overhead if we shrink to 1000 records per block. In CRAM we could put the table in the container header (Vadim's original suggestion infact) and have say 100 small slices per container such that we share the header meta-data cost between many blocks. Such an approach ought to work well for specifying an external LZ dictionary too.

For direct replacement of bgzf in BAM/BCF though it's less clear how to do this as the file contents have no block oriented structure of its own (other than by the bgzf layer itself).

jkbonfield avatar May 10 '17 08:05 jkbonfield

@jkbonfield I'm reviving this ancient issue rather than making a new duplicate because I think that nothing has fundamentally changed, even though zstd has significantly improved over the last 4 years. The bigger issue than implementing zstd in code is dealing with creating a new file format that existing toolchains and old versions of htslib may not be able to work with.

However, I've been experimenting with a forked version of bgzf that's using zstd instead of zlib / libdeflate, and I believe that there are some very significant gains to be had both for disk usage and compression / decompression speed. The lab that I'm in has a significant number of bgzip-compressed VCFs, and a relatively simple ZSTD swap reduces them to 60-70% of current bgzip-ed size while being much more CPU-efficient to decompress as well.

There's lesser benefit on .bams as you measured in this issue, but I think that there are options for using dictionaries to greater improve .bam compression. Zstd will still not achieve compression levels like lzma or especially .crams can, but it's so much faster that it would still be a significant improvement over zlib / libdeflate.

If I were to develop a bzstdf layer for accessing block-compressed files using zstd, with metadata frames holding info about block sizes for the indexer to read, and a dictionary included for several types of data (.bam, .vcf) or the ability to include a dictionary as a metadata frame to enable much greater compression with small block size, would you be interested in accepting it?

pettyalex avatar Sep 17 '21 03:09 pettyalex

I personally think it would be emminently sensible, although I'd have to check with others if it would be accepted. Certainly for new projects I've tried to push for such things. (https://github.com/VGP/vgp-tools/issues/1). It ought to be an entirely new tool though with a new suffix, as it'd be incompatible with bgzip. A choice of block size would help too. The default for bgzf is way too small IMO.

One other thing to consider though is also benchmarking against libbsc. This is another high performance tool which may offer better compression ratios, particularly with the newer releases and faster entropy encoding mode. https://github.com/IlyaGrebnov/libbsc. It may still be too slow for most use cases though, particularly decompression speed won't match zstd (which is a tough one to beat). Plus zstd has the weight of being a published RFC behind it. Basically zstd is to gzip as bsc is to bzip2.

jkbonfield avatar Sep 17 '21 08:09 jkbonfield

Having a bztd with configurable block size would be super helpful for a lot of reasons. Just last week I was thinking it would be nice to have a tabix alternative that worked with zstd compression.

reedacartwright avatar Sep 17 '21 18:09 reedacartwright

Yes, I've been experimenting with a new tool to compress / decompress (bgzstd) and a completely different layer with the same API as bgzf (bzstdf). I still need to finish the metadata blocks. We would also have to update tabix to read offset information from zstd metadata blocks, rather than the "extra data" field in the gzip header.

Using a dictionary would get around most of the overhead we get from using a smaller block size if we choose a smaller block size, so I really do believe it would be possible to have the best of all worlds with a higher compression ratio, faster decompression & compression, and still extremely fast random I/O enabled by the smaller block size.

pettyalex avatar Sep 17 '21 18:09 pettyalex

Agreed. How would you envisage the dictionary is used? There is a Zstd command (and I assume associated function) to compute a dictionary from a file, so we could build the dictionary from, say, the first few MB of file data and then apply it for all blocks. Maybe it would be good if the format permitted periodic rebuilding of the dictionary and a mechanism in the index that records where dictionary blocks are found.

jkbonfield avatar Sep 20 '21 08:09 jkbonfield

I was actually thinking that as part of the specification that includes Zstd compression, an appropriate dictionary could be included for each type of commonly block-compressed data, that is BAMs and VCFs, meaning that there could be a default dictionary available out of the box, that could potentially be overridden by a dictionary encoded in a metadata frame.

I need to test this to get real measurements, but this would make compression faster because there wouldn't have to be a training step, it would make output file sizes smaller because the dictionary would be in the bzstd binary, not in the input or output files.

A model for this is the Brotli compression format, which has seen significant benefits from having a dictionary as part of its spec that ends up being built into the binary: https://datatracker.ietf.org/doc/html/rfc7932#appendix-A

pettyalex avatar Sep 20 '21 20:09 pettyalex

Definitely sounds like an interesting research project.

reedacartwright avatar Sep 20 '21 20:09 reedacartwright

This is certainly worth looking into, especially as most users should now be on Linux distributions that include zstd packages. We'll need to think carefully about exactly how we integrate it, though. Some points to consider include:

  • How would it integrate with the existing BGZF / htsFile infrastructure?
  • What block sizes to use, and how would blocking work? Zstd seems to be inherently block-based, so presumably the concatenation trick used for BGZF wouldn't be needed.
  • How it works with the various indexes.
  • Do we want to embed the index in the compressed file itself? And if so, how?
  • Would it be possible to mark record boundaries in some way, so they can be discovered without having to uncompress the file? One of the annoyances of BGZF is that there's no way to do that, which makes splitting files difficult without some other form of index.

daviesrob avatar Nov 09 '21 11:11 daviesrob