EditTransducer icon indicating copy to clipboard operation
EditTransducer copied to clipboard

Timing of Levenshtein Automaton

Open funderburkjim opened this issue 7 years ago • 1 comments

Hi! I carried through a test of the Levenshtein Automaton matching for a lexicon of about 200k Sanskrit words. The work is in this repository, specifically in the ejf07 directory.

The automaton usage works fine, and is elegant. However, the efficiency was less than I had hoped.

Does the 5+ sec per closest match surprise you? Is there some optimization step that would drastically reduce this?

funderburkjim avatar Aug 17 '18 20:08 funderburkjim

When the lexicon is reasonably large I am not shocked that matching could take up to a few seconds. A few things that might help:

  • Use a smaller lexicon (obviously)
  • Consider using closest_match rather than closest_matches so as to avoid the pruning step
  • If certain types of edit operations don't make sense for your application, you can disable them; for instance if you believe the relationship between the query and the lexicon entry never involves a deletion, you can remove the deletion arcs in the edit factors
  • You can put an upper bound on the distance between the input and the output, which will drop this entire operation into a lower complexity class. I've made a sketchy example of how you might implement this here: https://gist.github.com/kylebgorman/9bce642c1b986e96ff6ce3a518769c74

I hope at some point to write that method up.

On Fri, Aug 17, 2018 at 4:44 PM funderburkjim [email protected] wrote:

Hi! I carried through a test of the Levenshtein Automaton matching for a lexicon of about 200k Sanskrit words. The work is in this repository https://github.com/funderburkjim/pynini-learn, specifically in the ejf07 directory.

The automaton usage works fine, and is elegant. However, the efficiency was less than I had hope.

Does the 5+ sec per closest match surprise you? Is there some optimization step that would drastically reduce this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/EditTransducer/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOe54SaLL7pLk6RB31htg4_GFo08Hks5uRyshgaJpZM4WCK8x .

kylebgorman avatar Aug 17 '18 21:08 kylebgorman