tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Levenshtein distance score with custom score function (POC for #998)

Open Restioson opened this issue 3 years ago • 8 comments

Hi, I've tried to implement the ability to use a custom score function (Fn(u8) -> f32) for fuzzy queries, which take the DFA distance as the input and return the score for the term. This is a proof of concept for #998, and I am opening this as a draft so that people can more easily see the diff and comment on it. If the solution is good, it could perhaps be merged into @lerouxrgd's branch first, and then #998 could be merged, or whatever people see fit.

Restioson avatar Jan 01 '22 13:01 Restioson

Codecov Report

:exclamation: No coverage uploaded for pull request base (main@8e7fe06). Click here to learn what that means. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1244   +/-   ##
=======================================
  Coverage        ?   94.10%           
=======================================
  Files           ?      206           
  Lines           ?    35072           
  Branches        ?        0           
=======================================
  Hits            ?    33003           
  Misses          ?     2069           
  Partials        ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8e7fe06...47f8723. Read the comment docs.

codecov-commenter avatar Jan 03 '22 07:01 codecov-commenter

I'll run rustfmt on this in a moment

Restioson avatar Jan 03 '22 08:01 Restioson

Another question: is there any way to factor in the length of the suffix of the term (when using the query as the prefix) to the score? If possible it seems like something that should be exposed.

Restioson avatar Jan 03 '22 13:01 Restioson

Another question: is there any way to factor in the length of the suffix of the term (when using the query as the prefix) to the score? If possible it seems like something that should be exposed.

Great question... Factoring in the length of the overall term should be relatively easy.

Factoring in the length of the suffix is harder, but we can probably cook something if we dig deep in the levenshtein automata crate. I suggest we use the length - querylength as a first approximation. That should probably do the job.

One case where that will give us different result is

-A: target: elephant, query: eleh vs -B: target: elephant, query: elepj

In this case, B will get a better score. But A could be the deletion of a character bringing us to the actual same suffix size, same levenshtein distance.

fulmicoton avatar Jan 03 '22 22:01 fulmicoton

I suggest we use the length - querylength as a first approximation. That should probably do the job.

This sounds good to me. How can we expose this to the user in such a way that they can factor it in?

Restioson avatar Jan 04 '22 08:01 Restioson

@Restioson a funciton is good I think.

Fn(lev_distance: u8, approximate_suffix_length: u32) -> f32

We can cache the result for lev_distance = 0,1,2,3,>=3 and approximate_suffix_length = 0..10 or something like that.

fulmicoton avatar Jan 12 '22 02:01 fulmicoton

As far as caching goes, would it maybe be cleaner to let the user do their own caching? They could always wrap it with a memoisation function if they wanted to, perhaps? I'm just wondering if caching might be extraneous in a situation of just reading off a predefined const value table.

Restioson avatar Jan 13 '22 10:01 Restioson

As far as caching goes, would it maybe be cleaner to let the user do their own caching? They could always wrap it with a memoisation function if they wanted to, perhaps? I'm just wondering if caching might be extraneous in a situation of just reading off a predefined const value table.

I agree that caching should be left to the user, I imagine in most cases anyway the calculations are going to be pretty simple and easily optimised by the compiler.

ChillFish8 avatar Jun 18 '22 14:06 ChillFish8