lance icon indicating copy to clipboard operation
lance copied to clipboard

Inverted indices should be used to speedup string filters.

Open westonpace opened this issue 1 year ago • 3 comments

If a string column has a FTS index then we should have enough information to speed up a variety of string-based filters. Here is a (currently very partial as I don't know what's possible) listing:

  • [ ] Equality queries
  • [ ] Range queries?
  • [ ] #3416

westonpace avatar Jan 24 '25 21:01 westonpace

Is there any worry that tokenization could mess with this? I think in general it only makes it a wider net by:

  1. Lower casing
  2. Stemming (running, run -> same token)
  3. Ascii folding (café, cafe -> same token)
  4. Stop word removal -> fewer words to match on.

It should be fine, but worth being aware of these transformations.

wjones127 avatar Jan 24 '25 23:01 wjones127

Hmm, yeah, it would be a problem if contains('run', 'running') returned true. Maybe a specialized index then. A GIN index like label list could work.

westonpace avatar Jan 25 '25 06:01 westonpace

Hmm, yeah, it would be a problem if contains('run', 'running') returned true. Maybe a specialized index then. A GIN index like label list could work.

I was thinking you could still use the FTS index, but would have a "refine" step where you take the results and do the exact contains test after. Not optimal in all cases, but could potentially work.

wjones127 avatar Jan 27 '25 22:01 wjones127