Compression cache of numeric docvalues
Description
When benchmarking recently with some OLAP engines (no indexes, no stored fields, only column data), the results showed that they only occupy 50-70% of the storage of NumericDocvalues, with comparable performance, which is surprising. I looked into their implementation and it turns out they simply use BitShuffle and LZ4 to compress data blocks on the write side, and use a global cache on the read side to cache decompressed data.
So in Lucene, we have non-compressed data (MMap) on both disk and in memory, but they have compressed data on disk and decompressed data in memory, which sounds quite reasonable to me. I believe that things like global cache can be easily done in a service (like ES) through a custom codec, but I still wonder if we can do something on our default codec?
IMO: just use a filesystem with this feature such as zfs.
Thanks for feedback!
I agree that a transparent compression filesystem is pretty straightforward and helpful. But i suspect it is hard for user to know when Lucene can take charge of compression (like term dictionary of SortedSetDocvalue), and when it should be delegated to filesystem. So what i wondered was "default behavior".
To be honest many of our users are moving away to save storage costs. We are implementing our custom codec, but it would be great if Lucene can be improved as well, though I understand it is not easy to introduce this in Lucene :)
The advantage of letting a filesystem such as zfs (which was designed to do exactly this), is that it is integrated in the correct place and operating system caches work as expected.
It is best to let the OS handle the caching, it will do a better job. Lucene caching the docvalues to me is just a step backwards to "fieldcache" which did not work well and caused a lot of operational pain.
OLAP engines splits format to codec and compression, both configurable. For example, you can:
- Use
ForUtilcodec andLZ4compression in normal filesystem, cache managed by engine. - Use
ForUtilcodec andNonecompression in compression filesystem, cache managed by operating system.
I guess the problem here is Lucene is wrapping every thing in its Codec, so some structure is compressed twice in a compression filesystem while others expand much in a normal filesystem. And it seems difficult for users to realize how much a compressed filesystem affects Lucene numeric docvalue. This is, perhaps, leaving too much responsibility on the user side.
I still feel like we have something to do here, but i don't really know how to do it correctly. Let's keep things as they are :)
Yeah, I’ve been thinking about this. Elasticsearch now supports a time_series index mode with DELTA + FOR encoding on doc values. In time series or logging scenarios, storage cost usually matters more than query performance.
@easyice Something like DELTA+FOR shouldn't require any cache, right? To me that is a different problem with other challenges: index would need to be e.g. sorted on timestamp field for the delta-compression to work effectively (i think?)
@rmuir You are right, it needs to be sorted on the timestamp field. In addition to enabling delta-compression on the timestamp field, index sorting brings another benefit: when sorting by timestamp and the query includes a condition on this field, the doc IDs after query filtering tend to be more contiguous (i.e., not completely randomly distributed). As a result, this mitigates the performance impact of random access when bulk decoding doc values.