couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Add zstd support as a compression method

Open zamazan4ik opened this issue 2 years ago • 3 comments

Summary

Add zstd codec support.

Desired Behaviour

CouchDB already supports multiple compression codecs (by the way, which exactly? Cannot find the documentation about it), so we want just to have the option to use zstd as another compression codec.

Additional context

We want to use zstd since it seems like one of the most advanced compression on the market and has a very good CPU/compression ratio.

zamazan4ik avatar Jan 21 '23 22:01 zamazan4ik

That's a good idea. We've even discussed it 7 years or so ago :-)

https://lists.apache.org/thread/kvvjodld2ly0t9rrllfd4d27pwf43hy5

My comment from 2016 was:

There has been a surprising resurgence of compression research lately with things like Brotli from Google and zstd from Facebook (http://facebook.github.io/zstd/). zstd has an interesting "training" mode where it can do a pass over small documents and learn a common dictionary, and CouchDB already passed over data during compaction, would that be a good time to train a compression dictionary?

At the time it wasn't sure how zstd would do, but it seems to have survived pretty well and even made its way into the Linux kernel.

That being said, if we just do the naive compression of doc bodies like we do now, it might not be worth it as there is a real future backward compatibility cost having to support another compression scheme essentially forever. It's kind of like that with snappy already. Now, one motivating factor about zstd would be if we could apply per-btree or per db compression using a dictionary. CouchDB's compaction provides a nice place to learn the dictionary and update during each compaction cycle.

CouchDB already supports multiple compression codecs (by the way, which exactly?

The current compression methods include snappy and the built-in Erlang term compression. See the docs for more details. The built-in Erlang compression is essentially deflate with 10 configurable levels.

nickva avatar Jan 22 '23 04:01 nickva

zstd support was merged to OTP 28 https://github.com/erlang/otp/pull/9316 and it has dictionary support

If it fares well there for a bit, and we'd really like it earlier say in OTP 26 or could bring in that NIF as is until we reach OTP 28 minimum.

nickva avatar Feb 10 '25 18:02 nickva

I like the idea of building a dictionary as we compact, it would be useful to see how much of a difference that makes, and particularly whether it varies much between databases (e.g, a dictionary trained on json itself might be all that we need, and that could be static across all databases).

We'd need to ensure that the dictionary is stored earlier in the .couch/.view file that any term that is compressed with it, which might get interesting, we'd also need it protected with a write barrier like we do for database headers.

rnewson avatar Feb 18 '25 09:02 rnewson