Make it possible to have high perf, non dictionary encoded Str/Bytes field

Open fulmicoton opened this issue 2 years ago • 1 comments

In Tantivy 0.20, we introduced the new columnar format.

The bytes fast field used to be single valued only. A column was storing start offset, and a second column was storing the bytes.

Doing a lookup was then as simple as

let start_offset = start_offset_col.get(doc);
let end_offset =  start_offset_col.get(doc + 1); // data is simply concatenated.
let data: &[u8] = data_col.get_slice(start_offset..end_offset);

After tantivy 0.20, bytes and string fast field are dictionary encoded. We have a term ord column, with any cardinality, and a dictionary that stores term_ord -> bytes payload in a SSTable.

That new approach is good when we have redundant strings, but when they are not redundant it is counter productive.

We probably need a solution to reintroduce a non-dictionary encoded solution and let the user configure the fast field encoding it needs.

The access interface will also have to change, as there is a benefit to expose the term ordinals when using dictionary encoded things.

(whoever pick this ticket, please write some design first.)

Jun 12 '23 08:06 fulmicoton

A potential solution that avoids having the user specify dictionary or non-dictionary encoding based on the cardinality might be to do something like parquet where the column chunk intelligently falls back to writing normal data pages if the dictionary page grows too large beyond some threshold (xMB), so that only a small portion of the data for that chunk will be encoded in the dictionary. https://parquet.apache.org/docs/file-format/data-pages/encodings/#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

Jul 04 '23 08:07 shaeqahmed