yake icon indicating copy to clipboard operation
yake copied to clipboard

Deduplication threshold changes the order of the response tuples

Open josemarcosrf opened this issue 2 years ago • 0 comments

I've noticed the following behavior of the .extract_keywords function:

When using a deduplication threshold (dedupLim) lower than 1, the response tuples are of the form (word, score). e.g.:

('non-profit', 0.18087033619667015)
('social', 0.21178928326651927)
('media', 0.21178928326651927)
('handle', 0.28189161752425324)

However, when equal or greater than 1, becomes:

(0.18087033619667015, 'non-profit')
(0.21178928326651927, 'social')
(0.21178928326651927, 'media')
(0.28189161752425324, 'handle')

Below the sample code which produces the above outputs:

import yake

text = 'I handle social media for a non-profit. Should I start going to social media networking events? Are there any good ones in the bay area?'

kw_extractor = yake.KeywordExtractor(lan="en", n=1, dedupLim=1, top=4, features=None)
keywords = kw_extractor.extract_keywords(text)
for kw in keywords:
    print(kw)

The issue seems to stem from the difference between these two lines: yake.py#L71 and yake.py#L85

Happy to submit a PR to fix it if is of any help

josemarcosrf avatar Apr 14 '22 12:04 josemarcosrf