abydos icon indicating copy to clipboard operation
abydos copied to clipboard

DiscountedLevenshtein can be less than Levenshtein....?

Open chrislit opened this issue 5 years ago • 4 comments

from abydos.distance import * lev = Levenshtein() dlev = DiscountedLevenshtein() lev.dist('cat', 'hat') < dlev.dist('cat', 'hat')

Is this correct, though?

chrislit avatar Jul 03 '19 20:07 chrislit

Also, this alignment seem sub-optimal. (I think the l in Neil should be matched with an l in Niall.)

cmp.alignment('Niall', 'Neil') (2.526064024369237, 'N-iall', 'Neil--')

chrislit avatar Jul 03 '19 20:07 chrislit

fixed alignment issue in b04ca90b

chrislit avatar Aug 05 '19 19:08 chrislit

This is a result of the normalizing term in combination with the discounting function. It's worth re-examining this issue to determine if the supplied discounting functions are good, but it's not a bug.

chrislit avatar Jan 07 '20 20:01 chrislit

Do you know of any code example of using abydos for matching two Python string lists by calculating minimal distances?

longRefList = ["Name 0001", "Name 0002", ... "Name 9999"]
mylist = ["Name 2345", "xdsdfj ABCD", "Name x23f"] 
# ... whatever code to calculate, 
# for each item in list 2, the distance & position of closest item in list 1 
# ... to output something like this:
matchOutput = [
    {"dist":0, "position":2344}, 
    {"dist":0.999, "position": 8831}, 
    {"dist":0.5, "position":230}
]

I am particularly interested in using ReesLevenshtein distance. But I wonder how slow could this be. Do you know if somebody has tried to use abydos for trying to merge pandas dataframes by minimal distance matching between two columns?

Thanks a lot in advance for your advice. @abubelinha

abubelinha avatar Feb 26 '22 11:02 abubelinha