lance Inverted indices should be used to speedup string filters.

If a string column has a FTS index then we should have enough information to speed up a variety of string-based filters. Here is a (currently very partial as I don't know what's possible) listing:

[ ] Equality queries
[ ] Range queries?
[ ] #3416

Jan 24 '25 21:01 westonpace

Is there any worry that tokenization could mess with this? I think in general it only makes it a wider net by:

Lower casing
Stemming (running, run -> same token)
Ascii folding (café, cafe -> same token)
Stop word removal -> fewer words to match on.

It should be fine, but worth being aware of these transformations.

Jan 24 '25 23:01 wjones127

Hmm, yeah, it would be a problem if contains('run', 'running') returned true. Maybe a specialized index then. A GIN index like label list could work.

Jan 25 '25 06:01 westonpace

Hmm, yeah, it would be a problem if contains('run', 'running') returned true. Maybe a specialized index then. A GIN index like label list could work.

I was thinking you could still use the FTS index, but would have a "refine" step where you take the results and do the exact contains test after. Not optimal in all cases, but could potentially work.

Jan 27 '25 22:01 wjones127