dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

use sqlite's fts5 for tf/idf index predicates

Open fgregg opened this issue 4 years ago • 3 comments

https://sqlite.org/fts5.html

fgregg avatar Mar 19 '21 17:03 fgregg

if i want to roll my own scoring https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/

fgregg avatar Mar 19 '21 17:03 fgregg

got a spike going here: https://github.com/dedupeio/dedupe/tree/sqlite_index_predicate

this uses fts5 which comes with bm25 as a default scorer. unfortunately, bm25 is not a normalized score, so we can't have threshold defined canopies.

so, we'll need to use a custom scorer. fts4 exposes "matchinfo" which makes it pretty easy to do that (a few examples from peewee).

It's also possible to write customer scorers for fts5, but i couldn't find any third party examples. Here's the bm25 "auxillary function" which could be a prototype.

fgregg avatar Mar 21 '21 21:03 fgregg

fts5 matchinfo implementation: https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_test_mi.c

fgregg avatar Mar 22 '21 13:03 fgregg