texdoc icon indicating copy to clipboard operation
texdoc copied to clipboard

Better threshold for fuzzy search?

Open kberry opened this issue 3 years ago • 3 comments

texdoc foobar finds float.pdf and other float and hep-float docs.

I see there are three characters in common, but it seems like too fuzzy of a match. But maybe I think that only because I'm a human (at least, I think I am; sometimes it seems a fuzzy question :). Just thought I'd mention ...

kberry avatar Jul 25 '22 21:07 kberry

Yes, I agree with you that "foobar" and "float" looks quite different for me as well, but to calculate the Levenshtein distance, it is just three (1. replace "o" with "l," 2. delete "b," 3. replace "r" with "t").

The distance of three is not so significant for long strings, but it is already too large for short strings, like in this case.

One of the solutions for this is using normalized Levenshtein distance. Let $q$ be a query and $c$ be a candidate match for $q$. There are several ways to normalize the distance, but one of the simplest ones can be calculated as follows:

$$d = \frac{\operatorname{Levenshtein}(q, c)}{\max(\operatorname{len}(q), \operatorname{len}(c))}$$

We can use a value for these types of Levenshtein distance variants for the threshold.

We have to consider only a small thing: we already have a config item of fuzzy_level which initially was expected to have a value for Levenshtein distance itself and can only take an integer value. We might want to maintain backward compatibility for this.

wtsnjp avatar Jul 26 '22 04:07 wtsnjp

We need an excellent calculation formula to convert an integer value of fuzzy_level to the threshold for a normalized Levenshtein distance. We want to make fuzzy_level roughly work the same for the keywords whose lengths are around six.

wtsnjp avatar Feb 19 '23 02:02 wtsnjp

Perhaps just using fuzzy_level / 10 as the threshold is OK.

wtsnjp avatar Feb 25 '23 08:02 wtsnjp