tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Add a GCD aware fastfield compression format.

Open fulmicoton opened this issue 3 years ago • 0 comments

In #1725, datetime fastfields will start being stored in microsecs unix timestamps.

Obviously the low bits will strongly hurt compression. In #1725 we will also introduce a precision, which will truncate all of the timestamps values.

For instance, if the precision is second, all of the fastfield values supplied to the fastfield writer can be divided by 1,000,000. We want a codec that leverage this compression by detecting the GCD rapidly, save the GCD as metadata and then divides the value by the GCD before relying on another codec for compression.

Outside of timestamps, it is not entirely uncommon to face values with a common GCD. (sensor values, pinball scores :), amount of money in IDR etc.). Checking for the presence of a GCD can be done fast and does not need to be perfect... e.g. compute the GCD over a sample of N values and if not GCD=1 then check that candidate over all of the values with https://crates.io/crates/libdivide. If check fails, assume GCD=1

The benefit is large (for bitpacking log2(1e6) is 20bits).

@Pseitz @evanxg852000 can you discuss together to understand what the target is?

fulmicoton avatar Jul 06 '22 03:07 fulmicoton

Closed via https://github.com/quickwit-oss/tantivy/pull/1418

PSeitz avatar Aug 24 '22 08:08 PSeitz