tantivy
tantivy copied to clipboard
Levenshtein distance score with custom score function (POC for #998)
Hi, I've tried to implement the ability to use a custom score function (Fn(u8) -> f32) for fuzzy queries, which take the DFA distance as the input and return the score for the term. This is a proof of concept for #998, and I am opening this as a draft so that people can more easily see the diff and comment on it. If the solution is good, it could perhaps be merged into @lerouxrgd's branch first, and then #998 could be merged, or whatever people see fit.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
main@8e7fe06). Click here to learn what that means. The diff coverage isn/a.
@@ Coverage Diff @@
## main #1244 +/- ##
=======================================
Coverage ? 94.10%
=======================================
Files ? 206
Lines ? 35072
Branches ? 0
=======================================
Hits ? 33003
Misses ? 2069
Partials ? 0
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 8e7fe06...47f8723. Read the comment docs.
I'll run rustfmt on this in a moment
Another question: is there any way to factor in the length of the suffix of the term (when using the query as the prefix) to the score? If possible it seems like something that should be exposed.
Another question: is there any way to factor in the length of the suffix of the term (when using the query as the prefix) to the score? If possible it seems like something that should be exposed.
Great question... Factoring in the length of the overall term should be relatively easy.
Factoring in the length of the suffix is harder, but we can probably cook something if we dig deep in the levenshtein automata crate. I suggest we use the length - querylength as a first approximation. That should probably do the job.
One case where that will give us different result is
-A: target: elephant, query: eleh vs -B: target: elephant, query: elepj
In this case, B will get a better score. But A could be the deletion of a character bringing us to the actual same suffix size, same levenshtein distance.
I suggest we use the length - querylength as a first approximation. That should probably do the job.
This sounds good to me. How can we expose this to the user in such a way that they can factor it in?
@Restioson a funciton is good I think.
Fn(lev_distance: u8, approximate_suffix_length: u32) -> f32
We can cache the result for lev_distance = 0,1,2,3,>=3 and approximate_suffix_length = 0..10 or something like that.
As far as caching goes, would it maybe be cleaner to let the user do their own caching? They could always wrap it with a memoisation function if they wanted to, perhaps? I'm just wondering if caching might be extraneous in a situation of just reading off a predefined const value table.
As far as caching goes, would it maybe be cleaner to let the user do their own caching? They could always wrap it with a memoisation function if they wanted to, perhaps? I'm just wondering if caching might be extraneous in a situation of just reading off a predefined const value table.
I agree that caching should be left to the user, I imagine in most cases anyway the calculations are going to be pretty simple and easily optimised by the compiler.