fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

How to get fuzzy index?

Open mridu-enigma opened this issue 4 years ago • 3 comments

I am using fuzzywuzzy to look for key-phrase like terms in corpuses.

FWIW, when there's a tie I'd like the tie to be broken by earliest match, so: is there a way to get the fuzzy-index of a match? I tried all functions in fuzz and process (using dir() to discover funcs like QWRatio, etc.)

For instance, I want some mechanism that ranks fuzz.partial_ratio('alex', 'alexa not') higher than fuzz.partial_ratio('alex', 'not alexa'), but for fuzzy matches (that's a simplistic example). How can I achieve this?

mridu-enigma avatar Dec 13 '20 04:12 mridu-enigma

I haven't thought about it deeply, but I think it's an ambiguous problem definition. You probably need a minimum threshold that you would consider a match before being able to implement. If you did that, a naive solution would be to just iterate the string and return the first index with a fuzzy-threshold matching your search key. There might be other more optimized solutions that would scale to longer strings. Hope this helps.

acslater00 avatar Dec 13 '20 06:12 acslater00

@mridu-enigma I don't understand it.

You want index of first token match or index where max similarity (density) is observed?

Say you are searching for o ordin in strings like co-ordinate, what output do you seek? 1 or 3?

@acslater00 I haven't grokked the entire code, perhaps you could shed some light of correct behavior.

s-c-p avatar Dec 14 '20 11:12 s-c-p

fuzz.partial_ratio searches for the best alignment of the shorter string to the longer string. It does not matter which way you insert them in as long as they do not have a similar length (for similar lengths the results can differ) In the code this is achieved by

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

Since you mention that you match a key against a phrase I assume that the key always has to be shorter than the phrase, so you might be able to implement this the following way:

def your_scorer(s1, s2):
  if len(s1) > len(s2):
    return 0
  return fuzz.partial_ratio(s1, s2)

maxbachmann avatar Dec 16 '20 15:12 maxbachmann