lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Open jpountz opened this issue 4 years ago • 6 comments

This moves doc values to an approach that is more similar to postings, where values are grouped in blocks of 128 values that are compressed together. Decoding a single value requires decoding the entire block that contains the value.

jpountz avatar Jul 27 '21 16:07 jpountz

This is really interesting/exciting!

I'm working through this PR now but I notice you've used a slightly different approach to the FOR encoding (compared to what's done in the postings). Is this intentional for some reason, or is it more to get something out quickly for benchmarking (results were interesting by the way!)? Is there a reason you chose not to use the existing ForUtil directly?

gsmiller avatar Jul 28 '21 16:07 gsmiller

Indeed I wanted to get something out quickly for benchmarking where I could easily play with different block sizes, while ForUtil is very rigid (hardcoded block size of 128 and explicitly rejects numbers of bits per value > 32).

jpountz avatar Jul 28 '21 17:07 jpountz

and explicitly rejects numbers of bits per value > 32

Ah right, of course this would be an issue here. Thanks for clarifying!

gsmiller avatar Jul 28 '21 17:07 gsmiller

@jpountz Can you consider using lz4 or zstd to directly compress the blocks? After index sorting of time series id, we compress blocks by lz4 or zstd, and we can get a large compression ratio?

weizijun avatar Aug 02 '21 07:08 weizijun

I suspect that general-purpose compression algorithms like LZ4 or Zstd would not be good fits for this, but it could indeed be interesting to see if we can reuse ideas from these compression algorithms e.g. to be able to detect cycles in the data.

For now I'm focusing on not making queries too much slower with this change so that it has a chance of making it to the default codec. I don't plan on adding more fancy compression schemes, which tend to make things slower. I'd rather look into things like that in a follow-up.

jpountz avatar Aug 03 '21 06:08 jpountz

I see this JIRA is closed, please close this PR as well

janhoy avatar Oct 01 '21 09:10 janhoy