tantivy
tantivy copied to clipboard
Handling null values in fastfield for better use in aggregation
Given a field my_field with a fastfield and a document with no value for my_field, tantivy will insert a default value in the fastfield column store. For an integer/float, the default value is 0.
When doing a mean aggregation, this could lead to unwanted results from a user's point of view.
Elasticsearch is handling this by doing an average only on the non-null values. You can also provide a missing value to make the average on all documents by using this missing value for documents that don't have a value.
This ticket requires to flesh out more details.
- what changes in the schema definition?
- what happens when a user inserts a doc with missing values?
I think we should change the fast_field API from
fn get(&self, doc: DocId) -> Item;
to
fn get(&self, doc: DocId) -> Option<Item>;
There's a conceptional issue, we don't know which values are null values, since val_if_missing is not stored.
Issue Outline
The paramater val_if_missing is only available for IntFastFieldWriter. The issue does not exist for MultiValuedFastFieldWriter fields.
The reason for this is that IntFastFieldWriter is like an array where the index is the docid, while MultiValuedFastFieldWriter has an indirection, at first you get a Range over the values for a docid, which is an empty range for null values.
Since the user may not know valid values for val_if_missing, we should drop this parameter. It also limits compatibilty between multi and single values for automatic conversion.
Potential Solutions
During Collection
-
Bitvec During collection, we can carry a bitvec to mark values as missing, since it's impossible to know valid values upfront.
-
Multivalued We could always use
MultiValuedFastFieldWriterfor collection and decide during serialization if a field should be single or multivalued. This would alter the API, on creation and accessing. This could allow better compression. It may affect also ergonomics, so the user could decide to not set any cardinality. Related Issue: https://github.com/quickwit-oss/tantivy/issues/1337
On Serialization
On serialization we have two possibilities when serializing into a single value fast field:
- Serialize a bitvec for the null values. This will cost 1bit per value The actual values for nulls may be important for compression. When using a bitvec, we could choose any values that would be optimal for the compression.
- compute a unused value for
val_if_missingChoosing a good unusedval_if_missingcan be tricky, to work well with the compression algorithms. The value should be stored in the metadata.
For both solutions, the idea would be to choose a value that works good enough for most cases and optimize for some selected use cases.
Impact
Fixing this issue will probably be a breaking change. Current single value fast fields don't store val_if_missing.
Reading fast fields may be affected, since multivalue fast fields could be converted to single value fast fields.
API
The fast field API should be changed to return Option
fn get(&self, doc: DocId) -> Item;
fn get(&self, doc: DocId) -> Option<Item>;
This will cause a small regression on some aggregations, due to the higher complexity of Option<T> handling
Option<T>
test aggregation::tests::bench::bench_aggregation_average_f64 ... bench: 8,031,760 ns/iter (+/- 559,795)
test aggregation::tests::bench::bench_aggregation_average_u64 ... bench: 6,808,943 ns/iter (+/- 224,061)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64 ... bench: 12,140,050 ns/iter (+/- 215,010)
test aggregation::tests::bench::bench_aggregation_histogram_only ... bench: 14,272,004 ns/iter (+/- 1,183,308)
test aggregation::tests::bench::bench_aggregation_histogram_only_hard_bounds ... bench: 12,934,024 ns/iter (+/- 519,398)
test aggregation::tests::bench::bench_aggregation_histogram_with_avg ... bench: 36,996,567 ns/iter (+/- 4,856,764)
test aggregation::tests::bench::bench_aggregation_range_only ... bench: 10,969,961 ns/iter (+/- 626,658)
test aggregation::tests::bench::bench_aggregation_stats_f64 ... bench: 8,801,985 ns/iter (+/- 217,994)
test aggregation::tests::bench::bench_aggregation_sub_tree ... bench: 16,883,301 ns/iter (+/- 878,801)
T
test aggregation::tests::bench::bench_aggregation_average_f64 ... bench: 6,234,289 ns/iter (+/- 166,424)
test aggregation::tests::bench::bench_aggregation_average_u64 ... bench: 5,088,771 ns/iter (+/- 41,264)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64 ... bench: 9,195,964 ns/iter (+/- 66,449)
test aggregation::tests::bench::bench_aggregation_histogram_only ... bench: 13,627,255 ns/iter (+/- 364,308)
test aggregation::tests::bench::bench_aggregation_histogram_only_hard_bounds ... bench: 12,166,709 ns/iter (+/- 756,541)
test aggregation::tests::bench::bench_aggregation_histogram_with_avg ... bench: 33,687,641 ns/iter (+/- 2,535,456)
test aggregation::tests::bench::bench_aggregation_range_only ... bench: 8,829,932 ns/iter (+/- 707,086)
test aggregation::tests::bench::bench_aggregation_stats_f64 ... bench: 6,503,152 ns/iter (+/- 984,881)
test aggregation::tests::bench::bench_aggregation_sub_tree ... bench: 12,998,556 ns/iter (+/- 1,228,321)