tantivy Handling null values in fastfield for better use in aggregation

Given a field my_field with a fastfield and a document with no value for my_field, tantivy will insert a default value in the fastfield column store. For an integer/float, the default value is 0.

When doing a mean aggregation, this could lead to unwanted results from a user's point of view. Elasticsearch is handling this by doing an average only on the non-null values. You can also provide a missing value to make the average on all documents by using this missing value for documents that don't have a value.

Apr 14 '22 08:04 fmassot

This ticket requires to flesh out more details.

what changes in the schema definition?
what happens when a user inserts a doc with missing values?

Apr 14 '22 08:04 fulmicoton

I think we should change the fast_field API from

fn get(&self, doc: DocId) -> Item;

to

fn get(&self, doc: DocId) -> Option<Item>;

Apr 14 '22 09:04 PSeitz

There's a conceptional issue, we don't know which values are null values, since val_if_missing is not stored.

Apr 14 '22 09:04 PSeitz

Issue Outline

The paramater val_if_missing is only available for IntFastFieldWriter. The issue does not exist for MultiValuedFastFieldWriter fields. The reason for this is that IntFastFieldWriter is like an array where the index is the docid, while MultiValuedFastFieldWriter has an indirection, at first you get a Range over the values for a docid, which is an empty range for null values.

Since the user may not know valid values for val_if_missing, we should drop this parameter. It also limits compatibilty between multi and single values for automatic conversion.

Potential Solutions

During Collection

Bitvec During collection, we can carry a bitvec to mark values as missing, since it's impossible to know valid values upfront.
Multivalued We could always use MultiValuedFastFieldWriter for collection and decide during serialization if a field should be single or multivalued. This would alter the API, on creation and accessing. This could allow better compression. It may affect also ergonomics, so the user could decide to not set any cardinality. Related Issue: https://github.com/quickwit-oss/tantivy/issues/1337

On Serialization

On serialization we have two possibilities when serializing into a single value fast field:

Serialize a bitvec for the null values. This will cost 1bit per value The actual values for nulls may be important for compression. When using a bitvec, we could choose any values that would be optimal for the compression.
compute a unused value for val_if_missing Choosing a good unused val_if_missing can be tricky, to work well with the compression algorithms. The value should be stored in the metadata.

For both solutions, the idea would be to choose a value that works good enough for most cases and optimize for some selected use cases.

Impact

Fixing this issue will probably be a breaking change. Current single value fast fields don't store val_if_missing.

Reading fast fields may be affected, since multivalue fast fields could be converted to single value fast fields.

API

The fast field API should be changed to return Option

    fn get(&self, doc: DocId) -> Item;

    fn get(&self, doc: DocId) -> Option<Item>;

May 09 '22 15:05 PSeitz

This will cause a small regression on some aggregations, due to the higher complexity of Option<T> handling

Option<T>
test aggregation::tests::bench::bench_aggregation_average_f64                                                            ... bench:   8,031,760 ns/iter (+/- 559,795)
test aggregation::tests::bench::bench_aggregation_average_u64                                                            ... bench:   6,808,943 ns/iter (+/- 224,061)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64                                                    ... bench:  12,140,050 ns/iter (+/- 215,010)
test aggregation::tests::bench::bench_aggregation_histogram_only                                                         ... bench:  14,272,004 ns/iter (+/- 1,183,308)
test aggregation::tests::bench::bench_aggregation_histogram_only_hard_bounds                                             ... bench:  12,934,024 ns/iter (+/- 519,398)
test aggregation::tests::bench::bench_aggregation_histogram_with_avg                                                     ... bench:  36,996,567 ns/iter (+/- 4,856,764)
test aggregation::tests::bench::bench_aggregation_range_only                                                             ... bench:  10,969,961 ns/iter (+/- 626,658)
test aggregation::tests::bench::bench_aggregation_stats_f64                                                              ... bench:   8,801,985 ns/iter (+/- 217,994)
test aggregation::tests::bench::bench_aggregation_sub_tree                                                               ... bench:  16,883,301 ns/iter (+/- 878,801)


T
test aggregation::tests::bench::bench_aggregation_average_f64                                                            ... bench:   6,234,289 ns/iter (+/- 166,424)
test aggregation::tests::bench::bench_aggregation_average_u64                                                            ... bench:   5,088,771 ns/iter (+/- 41,264)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64                                                    ... bench:   9,195,964 ns/iter (+/- 66,449)
test aggregation::tests::bench::bench_aggregation_histogram_only                                                         ... bench:  13,627,255 ns/iter (+/- 364,308)
test aggregation::tests::bench::bench_aggregation_histogram_only_hard_bounds                                             ... bench:  12,166,709 ns/iter (+/- 756,541)
test aggregation::tests::bench::bench_aggregation_histogram_with_avg                                                     ... bench:  33,687,641 ns/iter (+/- 2,535,456)
test aggregation::tests::bench::bench_aggregation_range_only                                                             ... bench:   8,829,932 ns/iter (+/- 707,086)
test aggregation::tests::bench::bench_aggregation_stats_f64                                                              ... bench:   6,503,152 ns/iter (+/- 984,881)
test aggregation::tests::bench::bench_aggregation_sub_tree                                                               ... bench:  12,998,556 ns/iter (+/- 1,228,321)

Sep 27 '22 03:09 PSeitz

tantivy tantivy copied to clipboard

Handling null values in fastfield for better use in aggregation

Issue Outline

Potential Solutions

During Collection

On Serialization

Impact

API

tantivy
tantivy copied to clipboard