quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

make tokenizer emit at least one empty token on empty strings

Open trinity-1686a opened this issue 1 year ago • 4 comments

Description

always emit at least one token when indexing/querying with an empty field

How was this PR tested?

tested manually. todo: add integration test

trinity-1686a avatar Oct 30 '24 22:10 trinity-1686a

On SSD:

Average search latency is 0.996x that of the reference (lower is better).
Ref run id: 4045, ref commit: 105aa7d146531358aa62e9b7058029bae622a2eb
Link

On GCS:

Average search latency is 0.925x that of the reference (lower is better).
Ref run id: 4087, ref commit: 826f10f0d7ba1a448795890fe296941b84872c57
Link

github-actions[bot] avatar Oct 30 '24 23:10 github-actions[bot]

Why do we need this?

PSeitz avatar Oct 31 '24 02:10 PSeitz

Why do we need this?

Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that

trinity-1686a avatar Oct 31 '24 09:10 trinity-1686a

Why do we need this?

Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that

Tokenization is already quite expensive during indexing, I think this change may add some non-negligable overhead there.

PSeitz avatar Oct 31 '24 12:10 PSeitz