quickwit
quickwit copied to clipboard
make tokenizer emit at least one empty token on empty strings
Description
always emit at least one token when indexing/querying with an empty field
How was this PR tested?
tested manually. todo: add integration test
On SSD:
Average search latency is 0.996x that of the reference (lower is better).
Ref run id: 4045, ref commit: 105aa7d146531358aa62e9b7058029bae622a2eb
Link
On GCS:
Average search latency is 0.925x that of the reference (lower is better).
Ref run id: 4087, ref commit: 826f10f0d7ba1a448795890fe296941b84872c57
Link
Why do we need this?
Why do we need this?
Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that
Why do we need this?
Today
{"field": ""}and{}get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that
Tokenization is already quite expensive during indexing, I think this change may add some non-negligable overhead there.