PolyFuzz
PolyFuzz copied to clipboard
TFIDF min_similarity not applied
When using the TFIDF
model the min_similiary
parameter seems not to be applied to the results.
Minimal Example that reproduces the problem (polyfuzz 0.4.0):
from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF
if __name__ == "__main__":
token_list = [
"Stoltenbergs",
"Ansage",
"Putin",
"Nato",
"Drohungen",
"Russlands",
"Nato",
"Unterstützung",
"Ukraine",
"Stoltenberg",
"Putin",
"Nato",
]
matcher = TFIDF(n_gram_range=(3, 3), min_similarity=0.9)
model = PolyFuzz(matcher)
model.match(token_list)
model.group()
matches = model.get_matches()
print(matches)
When running the code the following output is generated, but the rows 4 and 7 should have a Similarity score of 0, if I understand the documentation correctly.
The minimum similarity between strings, otherwise return 0 similarity
I would expect the rows with a Similarity of < 0.9 to have a Similarity of 0 and a To
value of None.
Output:
From To Similarity Group
0 Stoltenbergs Stoltenberg 0.932 Stoltenbergs
1 Ansage None 0.000 None
2 Putin Putin 1.000 Putin
3 Nato Nato 1.000 Nato
4 Drohungen Unterstützung 0.091 Unterstützung
5 Russlands None 0.000 None
6 Nato Nato 1.000 Nato
7 Unterstützung Drohungen 0.091 Drohungen
8 Ukraine None 0.000 None
9 Stoltenberg Stoltenbergs 0.932 Stoltenbergs
10 Putin Putin 1.000 Putin
11 Nato Nato 1.000 Nato
In case I'm using the library wrong, how would I be able to get only results with a similarity higher than 0.9
?