texdoc Better threshold for fuzzy search?

texdoc foobar finds float.pdf and other float and hep-float docs.

I see there are three characters in common, but it seems like too fuzzy of a match. But maybe I think that only because I'm a human (at least, I think I am; sometimes it seems a fuzzy question :). Just thought I'd mention ...

Jul 25 '22 21:07 kberry

Yes, I agree with you that "foobar" and "float" looks quite different for me as well, but to calculate the Levenshtein distance, it is just three (1. replace "o" with "l," 2. delete "b," 3. replace "r" with "t").

The distance of three is not so significant for long strings, but it is already too large for short strings, like in this case.

One of the solutions for this is using normalized Levenshtein distance. Let $q$ be a query and $c$ be a candidate match for $q$. There are several ways to normalize the distance, but one of the simplest ones can be calculated as follows:

$$d = \frac{\operatorname{Levenshtein}(q, c)}{\max(\operatorname{len}(q), \operatorname{len}(c))}$$

We can use a value for these types of Levenshtein distance variants for the threshold.

We have to consider only a small thing: we already have a config item of fuzzy_level which initially was expected to have a value for Levenshtein distance itself and can only take an integer value. We might want to maintain backward compatibility for this.

Jul 26 '22 04:07 wtsnjp

We need an excellent calculation formula to convert an integer value of fuzzy_level to the threshold for a normalized Levenshtein distance. We want to make fuzzy_level roughly work the same for the keywords whose lengths are around six.

Feb 19 '23 02:02 wtsnjp

Perhaps just using fuzzy_level / 10 as the threshold is OK.

Feb 25 '23 08:02 wtsnjp