LibLangly
LibLangly copied to clipboard
FuzzyEquals should be using a better edit-distance algorithm
Currently, FuzzyEquals
makes use of the Levenshtein edit-distance algorithm. This counts substitutions, insertions, and deletions, but not transpositions. FuzzyEquals()
should take into consideration all four edits.
Damerau-Levenshtein edit-distance is probably the best compromise between accuracy and performance for general purpose use (other edit-distance algorithms should still be implemented). However there's two pitfalls when finding a good implementation for this. As a result, lifting an existing implementation isn't likely.
- The overwhelming majority of implementations actually implement Optimal String Alignment, which doesn't return the same results. OSA is not the same algorithm, although it is very similar.
- The few that implement DL implement it incorrectly. I've emailed the authors about this, and have received no response within a week, so I'm not particularly hopeful about that getting fixed.
So pretty par-for-the-norm with Comp. Sci. stuff.
These may be useful as far as a good implementation goes.
Hyyrö, H. A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances. Nord. J. Comput. 2002.
Balhaf, K. Alsmirat, M. et al. Accelerating Levenshtein and Damerau edit distance algorithms using GPU with unified memory. ICICS. 2017.
Damerau-Levenshtein request is here Stringier/Metrics#2
I mistakenly marked this as a good issue for contributors. Implementation of edit-distance algorithms is a good first issue. However, choosing which one provides the default FuzzyEquals behavior is my decision, and will be based on analysis of several alternatives overall performance curves and capabilities.