fuzzywuzzy [Feature Suggestion] Token permutation ratio

[Feature Suggestion] Token permutation ratio

Open boechat107 opened this issue 5 years ago • 2 comments

My motivation to suggest this feature comes from this example:

In : fuzz.token_sort_ratio('nfs-e Ndamero', 'numero nfs-e')
Out: 56

At least in my context, it is pretty clear that both strings are very very similar, but Python sorted function sorts tokens in an "undesired" way:

# Using lower() just simplify the analysis.
In : sorted('nfs-e Ndamero'.lower().split())
Out: ['ndamero', 'nfs-e']

In : sorted('numero nfs-e'.split())
Out: ['nfs-e', 'numero']

It might be interesting to add a function that calculates the match for all possible permutations of tokens:

from itertools import permutations, product

def fuzzy_token_permutation_ratio(text1: str, text2: str) -> int:
    def mk_permutations(text):
        return (' '.join(comb) for comb in permutations(text.lower().split()))
    t1_combs = mk_permutations(text1)
    t2_combs = mk_permutations(text2)
    return max(fuzz.ratio(t1, t2) for t1, t2 in product(t1_combs, t2_combs))

In : fuzzy_token_permutation_ratio('nfs-e Ndamero', 'nfs-e numero')
Out: 88

Does it make sense for someone else?

PS.: fuzzywuzzy is a great library!

Oct 06 '19 19:10 boechat107

I posted a similar ticket for this same issue #272.

Contrary to your solution I would propose to more smartly sort the tokens - e.g. by counting matching ngrams between the tokens of both strings - before applying fuzz.ratio.

I fear that calculating the full ratio for each possible permutation will explode computational time, especially when performing this function on a large set of examples, e.g. in database matching.

Apr 02 '20 11:04 shbunder

Uhum, I see. I think you do have a point, @shbunder.

Apr 02 '20 11:04 boechat107

fuzzywuzzy fuzzywuzzy copied to clipboard

[Feature Suggestion] Token permutation ratio

fuzzywuzzy
fuzzywuzzy copied to clipboard