stdlib
stdlib copied to clipboard
RFC: Edit distances for strings
Checklist
Please ensure the following tasks are completed before filing an issue.
- [x] Read and understood the Code of Conduct.
- [x] Searched for existing issues and pull requests.
- [x] If this is a general question, searched the FAQ for an existing answer.
- [x] If this is a feature request, the issue name begins with
RFC:.
Description
Description of the issue (or feature request).
This RFC proposes the inclusion of edit distance functions to compare two strings, for example the Levenstein distance.
Related Issues
Does this issue (or feature request) have any related issues?
No.
Questions
Any questions for reviewers?
What would be a good place inside of stdlib for string distance functions?
For string distance functions, in the @stdlib/string namespace. I believe one of the initial reasons for not including Levenstein, Hamming, and others, was whether we needed to expose generic interfaces. For example, Hamming could be used to compute the "distance" between two arrays. I was not sure how general we need to be.
@Planeshifter We should probably turn this into a tracking issue for implementing various metrics (Hamming, Levenshtein, etc), and cross-link to the existing issues.
I looked at Levenshtein distances for strings this morning. From the above discussion I am not sure if this is still wanted?
Anyway, some notes:
For fun, my implementation of Wikipedia's naive recursive algorithm, in Standard ML:
fun min_of_3(a, b, c) = let
val min_1 = Int.min(a, b)
in
Int.min(min_1, c)
end
fun lev(a, []) = length a
| lev([], b) = length b
| lev(x::xs, y::ys) =
(*if x=y then*)
if Char.compare(x, y) = EQUAL then
lev(xs, ys)
else
1 + min_of_3 ( lev(xs, y::ys), lev(x::xs, ys), lev(xs, ys) )
fun timeLevStr s1 s2 = let
val tmr = Timer.startRealTimer()
val d = levstr s1 s2
val t = Timer.checkRealTimer tmr
in
(d, t)
end
Very very very slow. This takes about 5 minutes.
- timeLevStr "alphabetically" "betamaxrecording"; val it = (14,TIME {usec=306635216}) : int * Time.time
Some links:
- https://xlinux.nist.gov/dads/HTML/Levenshtein.html -- links to this java implementation:
- https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm -- I have this Java running and the same lev("alphabetically", "betamaxrecording") takes no noticeable time.
- Perl CPAN: https://metacpan.org/pod/Text::Levenshtein
Do we want to do this?
@rgizz Thanks for following up here. We have an initial implementation of Levenshtein as @stdlib/string/base/distances/levenschtein; however, that implementation is currently sub-optimal and needs to be refactored. I have notes, somewhere, which I need to dig up which outline what changes need to be made; however, the changes are non-trivial.
One additional hurdle is that the refactoring needs to address handling of grapheme clusters, as linked to in the following tracking issue: https://github.com/stdlib-js/stdlib/issues/1062.