PolyFuzz icon indicating copy to clipboard operation
PolyFuzz copied to clipboard

TFIDF min_similarity not applied

Open philkoch opened this issue 1 year ago • 4 comments

When using the TFIDF model the min_similiary parameter seems not to be applied to the results.

Minimal Example that reproduces the problem (polyfuzz 0.4.0):

from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF

if __name__ == "__main__":
    token_list = [
        "Stoltenbergs",
        "Ansage",
        "Putin",
        "Nato",
        "Drohungen",
        "Russlands",
        "Nato",
        "Unterstützung",
        "Ukraine",
        "Stoltenberg",
        "Putin",
        "Nato",
    ]

    matcher = TFIDF(n_gram_range=(3, 3), min_similarity=0.9)
    model = PolyFuzz(matcher)
    model.match(token_list)
    model.group()
    matches = model.get_matches()
    print(matches)

When running the code the following output is generated, but the rows 4 and 7 should have a Similarity score of 0, if I understand the documentation correctly.

The minimum similarity between strings, otherwise return 0 similarity

I would expect the rows with a Similarity of < 0.9 to have a Similarity of 0 and a To value of None.

Output:

             From             To  Similarity          Group
0    Stoltenbergs    Stoltenberg       0.932   Stoltenbergs
1          Ansage           None       0.000           None
2           Putin          Putin       1.000          Putin
3            Nato           Nato       1.000           Nato
4       Drohungen  Unterstützung       0.091  Unterstützung
5       Russlands           None       0.000           None
6            Nato           Nato       1.000           Nato
7   Unterstützung      Drohungen       0.091      Drohungen
8         Ukraine           None       0.000           None
9     Stoltenberg   Stoltenbergs       0.932   Stoltenbergs
10          Putin          Putin       1.000          Putin
11           Nato           Nato       1.000           Nato

In case I'm using the library wrong, how would I be able to get only results with a similarity higher than 0.9?

philkoch avatar Oct 26 '22 16:10 philkoch