stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

RFC: Edit distances for strings

Open Planeshifter opened this issue 7 years ago • 4 comments

Checklist

Please ensure the following tasks are completed before filing an issue.

  • [x] Read and understood the Code of Conduct.
  • [x] Searched for existing issues and pull requests.
  • [x] If this is a general question, searched the FAQ for an existing answer.
  • [x] If this is a feature request, the issue name begins with RFC:.

Description

Description of the issue (or feature request).

This RFC proposes the inclusion of edit distance functions to compare two strings, for example the Levenstein distance.

Related Issues

Does this issue (or feature request) have any related issues?

No.

Questions

Any questions for reviewers?

What would be a good place inside of stdlib for string distance functions?

Planeshifter avatar Feb 17 '18 22:02 Planeshifter

For string distance functions, in the @stdlib/string namespace. I believe one of the initial reasons for not including Levenstein, Hamming, and others, was whether we needed to expose generic interfaces. For example, Hamming could be used to compute the "distance" between two arrays. I was not sure how general we need to be.

kgryte avatar Feb 17 '18 23:02 kgryte

@Planeshifter We should probably turn this into a tracking issue for implementing various metrics (Hamming, Levenshtein, etc), and cross-link to the existing issues.

kgryte avatar Jul 15 '23 00:07 kgryte

I looked at Levenshtein distances for strings this morning. From the above discussion I am not sure if this is still wanted?

Anyway, some notes:

For fun, my implementation of Wikipedia's naive recursive algorithm, in Standard ML:

fun min_of_3(a, b, c) = let
    val min_1 = Int.min(a, b)
    in
        Int.min(min_1, c) 
    end

fun lev(a, []) = length a
|   lev([], b) = length b
|   lev(x::xs, y::ys) = 
    (*if x=y then*)
    if Char.compare(x, y) = EQUAL then 
        lev(xs, ys) 
    else
        1 + min_of_3 ( lev(xs, y::ys), lev(x::xs, ys), lev(xs, ys) )

fun timeLevStr s1 s2 = let
    val tmr = Timer.startRealTimer()
    val d   = levstr s1 s2
    val t   = Timer.checkRealTimer tmr
    in
        (d, t)
    end

Very very very slow. This takes about 5 minutes.

  • timeLevStr "alphabetically" "betamaxrecording"; val it = (14,TIME {usec=306635216}) : int * Time.time

Some links:

  • https://xlinux.nist.gov/dads/HTML/Levenshtein.html -- links to this java implementation:
  • https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm -- I have this Java running and the same lev("alphabetically", "betamaxrecording") takes no noticeable time.
  • Perl CPAN: https://metacpan.org/pod/Text::Levenshtein

Do we want to do this?

rgizz avatar Nov 17 '23 17:11 rgizz

@rgizz Thanks for following up here. We have an initial implementation of Levenshtein as @stdlib/string/base/distances/levenschtein; however, that implementation is currently sub-optimal and needs to be refactored. I have notes, somewhere, which I need to dig up which outline what changes need to be made; however, the changes are non-trivial.

One additional hurdle is that the refactoring needs to address handling of grapheme clusters, as linked to in the following tracking issue: https://github.com/stdlib-js/stdlib/issues/1062.

kgryte avatar Nov 17 '23 19:11 kgryte