fuzzywuzzy
fuzzywuzzy copied to clipboard
[Feature Suggestion] Token permutation ratio
My motivation to suggest this feature comes from this example:
In : fuzz.token_sort_ratio('nfs-e Ndamero', 'numero nfs-e')
Out: 56
At least in my context, it is pretty clear that both strings are very very similar, but Python sorted
function sorts tokens in an "undesired" way:
# Using lower() just simplify the analysis.
In : sorted('nfs-e Ndamero'.lower().split())
Out: ['ndamero', 'nfs-e']
In : sorted('numero nfs-e'.split())
Out: ['nfs-e', 'numero']
It might be interesting to add a function that calculates the match for all possible permutations of tokens:
from itertools import permutations, product
def fuzzy_token_permutation_ratio(text1: str, text2: str) -> int:
def mk_permutations(text):
return (' '.join(comb) for comb in permutations(text.lower().split()))
t1_combs = mk_permutations(text1)
t2_combs = mk_permutations(text2)
return max(fuzz.ratio(t1, t2) for t1, t2 in product(t1_combs, t2_combs))
In : fuzzy_token_permutation_ratio('nfs-e Ndamero', 'nfs-e numero')
Out: 88
Does it make sense for someone else?
PS.: fuzzywuzzy
is a great library!
I posted a similar ticket for this same issue #272.
Contrary to your solution I would propose to more smartly sort the tokens - e.g. by counting matching ngrams between the tokens of both strings - before applying fuzz.ratio
.
I fear that calculating the full ratio for each possible permutation will explode computational time, especially when performing this function on a large set of examples, e.g. in database matching.
Uhum, I see. I think you do have a point, @shbunder.