lance icon indicating copy to clipboard operation
lance copied to clipboard

feat: add support for ngram indices

Open westonpace opened this issue 9 months ago • 1 comments

Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up contains(col, 'substr') filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.

This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.

westonpace avatar Feb 21 '25 00:02 westonpace

Codecov Report

Attention: Patch coverage is 55.65611% with 392 lines in your changes missing coverage. Please review.

Project coverage is 78.46%. Comparing base (b185a27) to head (3d43c6c).

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram.rs 34.44% 227 Missing and 9 partials :warning:
rust/lance-index/src/scalar/expression.rs 37.03% 65 Missing and 3 partials :warning:
rust/lance-index/src/scalar/inverted/index.rs 15.00% 17 Missing :warning:
rust/lance-index/src/scalar.rs 69.23% 11 Missing and 1 partial :warning:
rust/lance/src/index/scalar.rs 71.87% 6 Missing and 3 partials :warning:
rust/lance-index/src/scalar/btree.rs 38.46% 6 Missing and 2 partials :warning:
rust/lance-index/src/scalar/label_list.rs 52.94% 7 Missing and 1 partial :warning:
rust/lance/src/dataset/scanner.rs 94.52% 0 Missing and 8 partials :warning:
rust/lance-core/src/utils/mask.rs 0.00% 7 Missing :warning:
rust/lance/src/io/exec/scalar_index.rs 40.00% 3 Missing and 3 partials :warning:
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3468      +/-   ##
==========================================
- Coverage   78.68%   78.46%   -0.22%     
==========================================
  Files         251      252       +1     
  Lines       93030    93792     +762     
  Branches    93030    93792     +762     
==========================================
+ Hits        73201    73597     +396     
- Misses      16853    17200     +347     
- Partials     2976     2995      +19     
Flag Coverage Δ
unittests 78.46% <55.65%> (-0.22%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Feb 21 '25 01:02 codecov-commenter

This looks good to me. I have done some local testing with a 10M row dataset and the impact is significant, at least for selective searches. The index is slow to build and could benefit from parallelization but that doesn't need to hold things up for now.

wkalt avatar Feb 25 '25 16:02 wkalt