marisa-trie icon indicating copy to clipboard operation
marisa-trie copied to clipboard

Fuzzy Matching

Open bkj opened this issue 9 years ago • 5 comments
trafficstars

I see that this data structure supports prefix lookups -- does it also support fuzzy lookups (i.e. all records within Levenshtein distance). If that's not supported in this package / this data structure, do you know of any other packages that would let me do in-memory fuzzy searching?

~ Ben

bkj avatar Jan 20 '16 01:01 bkj

In theory you can do fuzzy matching with tries via branching search, but a more efficient way might be to use a Levenshtein automata. I've never used it in Python, so I can't recommend a specific package, but there're some, e.g. leven.

superbobry avatar Jan 23 '16 21:01 superbobry

Ah thanks. I figured that Levenshtein automata would work for this, but I couldn't find a python implementation. The one that you linked computes Levenshtein distance but according to the docs, Levenshtein automata are still to be implemented.

There is a node module (https://www.npmjs.com/package/node-levenshtein-automata) that implements Levenshtein automata, but I haven't had a chance to take a look at it yet.

I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python).

~ Ben

bkj avatar Jan 23 '16 22:01 bkj

JFYI there's another blog post with Python code for building and using the automata.

I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python).

Feel free to submit a PR. I'm not sure if anyone uses tries for insertions/deletions (as in Leveshtein distance), though, so if you come up with a simpler mismatch-only solution, it would be useful as well.

superbobry avatar Jan 23 '16 23:01 superbobry

Hey,

I think it won't be possible to use marisa-trie efficiently for that, as it lacks an ability to do step-wise walking (see https://code.google.com/p/marisa-trie/issues/detail?id=9 and https://github.com/kmike/marisa-trie/issues/20). https://github.com/kmike/DAWG or https://github.com/kmike/DAWG-Python may be a better fit.

kmike avatar Jan 24 '16 09:01 kmike

Here's a fast Levenstein search. It uses ternary search trees. https://github.com/mattandahalfew/Levenshtein_search

fgregg avatar Sep 18 '17 19:09 fgregg