fuzzysearch icon indicating copy to clipboard operation
fuzzysearch copied to clipboard

RankFind does not use Levenshtein distance to build results

Open colelawrence opened this issue 8 years ago • 6 comments

RankFind still uses plain Find to determine initial results, then measures Levenshtein distance afterwards. Maybe we can have another RankFindLevenshtein that uses Levenshtein distance to determine results.

colelawrence avatar Sep 08 '15 06:09 colelawrence

I actually played around with something like that for RankFind at one point. What I found was that unless you set a high threshold, most longer targets would never match.

Say you search for "aaa" in "aaabbbbbbbbbbbbbbbbbbb" and your threshold is 15 it wouldn't match. Perhaps that's exactly what you wanted, but I think it might have to be a bit smarter and take length into consideration as well or something... This is basically when I went with the current solution of RankFind :smiley:

Another approach would be to create a RankFindFunc(s, t string, fn func(t, s string) bool) where you can define your own criteria.

lithammer avatar Sep 09 '15 11:09 lithammer

Then you'd have something like this

const threshold = 15

func predicate(s, t string) bool {
    distance := LevenshteinDistance(s, t)  // 19
    return distance < threshold
}

fuzzy.RankFindFunc("aaa", "aaabbbbbbbbbbbbbbbbbbb", predicate)  // false

lithammer avatar Sep 09 '15 11:09 lithammer

Interesting, thank you for the explanations!

On Wed, Sep 9, 2015, 6:08 AM Peter Renström [email protected] wrote:

Then you'd have something like this

const threshold = 15 func predicate(s, t string) bool { distance := LevenshteinDistance(s, t) return distance < threshold } RankFindFunc("aaa", "aaabbbbbbbbbbbbbbbbbbb", predicate)

— Reply to this email directly or view it on GitHub https://github.com/renstrom/fuzzysearch/issues/10#issuecomment-138877551 .

colelawrence avatar Sep 09 '15 14:09 colelawrence

Yeah, so I'm not sure the distance alone is enough to determine a match. Having "a" match "ab" but not "abc" is confusing, it should give at least as many hits as a plain old substring search.

Did you have a good use-case for this? Perhaps I'm just missing something here :smile:

lithammer avatar Sep 09 '15 15:09 lithammer

Yeah, so I use it to create a search engine for my GoCourseSort project, and the reason I use Levenshtein instead of simple match, is so a word in the search query matches a keyword from the database of strings.

I think it is vaguely similar to the way BigTables work (don't quote me).

Imagine book titles: "Gone with the Wind", "Gone Girl", and "The Girls", we create an index of key words with references:

gonegirl := &Book{
  title: "Gone Girl",
}
gonewiththewind := &Book{
  title: "Gone with the Wind",
}
thegirls := &Book{
  title: "The Girls",
}
indexByKeywords := map[string][]*Book {
  "girls": { thegirls },
  "girl": { gonegirl },
  "gone": { gonegirl, gonewiththewind },
  "with": { gonewiththewind },
  "the": { gonewiththewind, thegirls },
  "wind": { gonewiththewind },
}

Now you enter the search term "the girls"

Then I'm Levenshtein ranking "the" against the keys of indexByKeywords, and "girls" against the keys of indexByKeywords, then using a ranking formula based on Levenshtein distance, index of word in search, and index of word in title, for each list of references I get back per search term.

This is important because in my search engine, I don't want order of words to be important in any way, but I need "CS" to match "CSC" and "131" to match "121", because I'm using it for a college course catalog.

colelawrence avatar Sep 10 '15 19:09 colelawrence

You might want to look at my implementation and the fuzzy find

elazarl avatar Jan 07 '16 18:01 elazarl