lance
lance copied to clipboard
feat: add support for ngram indices
Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up contains(col, 'substr') filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.
This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.
Codecov Report
Attention: Patch coverage is 55.65611% with 392 lines in your changes missing coverage. Please review.
Project coverage is 78.46%. Comparing base (
b185a27) to head (3d43c6c).
Additional details and impacted files
@@ Coverage Diff @@
## main #3468 +/- ##
==========================================
- Coverage 78.68% 78.46% -0.22%
==========================================
Files 251 252 +1
Lines 93030 93792 +762
Branches 93030 93792 +762
==========================================
+ Hits 73201 73597 +396
- Misses 16853 17200 +347
- Partials 2976 2995 +19
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 78.46% <55.65%> (-0.22%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This looks good to me. I have done some local testing with a 10M row dataset and the impact is significant, at least for selective searches. The index is slow to build and could benefit from parallelization but that doesn't need to hold things up for now.