lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Compression cache of numeric docvalues

Open gf2121 opened this issue 6 months ago • 7 comments

Description

When benchmarking recently with some OLAP engines (no indexes, no stored fields, only column data), the results showed that they only occupy 50-70% of the storage of NumericDocvalues, with comparable performance, which is surprising. I looked into their implementation and it turns out they simply use BitShuffle and LZ4 to compress data blocks on the write side, and use a global cache on the read side to cache decompressed data.

So in Lucene, we have non-compressed data (MMap) on both disk and in memory, but they have compressed data on disk and decompressed data in memory, which sounds quite reasonable to me. I believe that things like global cache can be easily done in a service (like ES) through a custom codec, but I still wonder if we can do something on our default codec?

gf2121 avatar Jun 17 '25 16:06 gf2121

IMO: just use a filesystem with this feature such as zfs.

rmuir avatar Jun 17 '25 19:06 rmuir

Thanks for feedback!

I agree that a transparent compression filesystem is pretty straightforward and helpful. But i suspect it is hard for user to know when Lucene can take charge of compression (like term dictionary of SortedSetDocvalue), and when it should be delegated to filesystem. So what i wondered was "default behavior".

To be honest many of our users are moving away to save storage costs. We are implementing our custom codec, but it would be great if Lucene can be improved as well, though I understand it is not easy to introduce this in Lucene :)

gf2121 avatar Jun 18 '25 05:06 gf2121

The advantage of letting a filesystem such as zfs (which was designed to do exactly this), is that it is integrated in the correct place and operating system caches work as expected.

It is best to let the OS handle the caching, it will do a better job. Lucene caching the docvalues to me is just a step backwards to "fieldcache" which did not work well and caused a lot of operational pain.

rmuir avatar Jun 18 '25 18:06 rmuir

OLAP engines splits format to codec and compression, both configurable. For example, you can:

  • Use ForUtil codec and LZ4 compression in normal filesystem, cache managed by engine.
  • Use ForUtil codec and None compression in compression filesystem, cache managed by operating system.

I guess the problem here is Lucene is wrapping every thing in its Codec, so some structure is compressed twice in a compression filesystem while others expand much in a normal filesystem. And it seems difficult for users to realize how much a compressed filesystem affects Lucene numeric docvalue. This is, perhaps, leaving too much responsibility on the user side.

I still feel like we have something to do here, but i don't really know how to do it correctly. Let's keep things as they are :)

gf2121 avatar Jun 19 '25 07:06 gf2121

Yeah, I’ve been thinking about this. Elasticsearch now supports a time_series index mode with DELTA + FOR encoding on doc values. In time series or logging scenarios, storage cost usually matters more than query performance.

easyice avatar Jun 20 '25 00:06 easyice

@easyice Something like DELTA+FOR shouldn't require any cache, right? To me that is a different problem with other challenges: index would need to be e.g. sorted on timestamp field for the delta-compression to work effectively (i think?)

rmuir avatar Jun 20 '25 03:06 rmuir

@rmuir You are right, it needs to be sorted on the timestamp field. In addition to enabling delta-compression on the timestamp field, index sorting brings another benefit: when sorting by timestamp and the query includes a condition on this field, the doc IDs after query filtering tend to be more contiguous (i.e., not completely randomly distributed). As a result, this mitigates the performance impact of random access when bulk decoding doc values.

easyice avatar Jun 20 '25 04:06 easyice