KeyBERT icon indicating copy to clipboard operation
KeyBERT copied to clipboard

No scores when candidates parameter is added

Open AroundtheGlobe opened this issue 1 year ago • 2 comments

No scores are returned when you provide the candidates parameter for KeyBERT()

from keybert import KeyBERT

doc = """
         Kos. Griekenland staat bekend om de prachtige eilanden waar je terecht kan voor zonovergoten vakanties.
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates=['Griekenland', 'Kos'])

Shows the warning message:

\venv\lib\site-packages\sklearn\feature_extraction\text.py:1369: UserWarning: Upper case characters found in vocabulary while 'lowercase' is True. These entries will not be matched with any documents
  warnings.warn(

and keywords variable is returned empty.

Without the candidates paramater it does return a result with scores: keywords = kw_model.extract_keywords(doc) Result: [('griekenland', 0.5619), ('zonovergoten', 0.5024), ('bekend', 0.4398), ('prachtige', 0.4118), ('terecht', 0.4039)]

When I change the candidates words to lower case words or when I add lowercase=False to the CountVectorizer it seems to return the words with a score as expected.:

keywords = kw_model.extract_keywords(doc, candidates=['griekenland', 'kos'])

In version 0.6.0 of KeyBERT() it wasn't an issue if the candidates words where capitalized.

count = CountVectorizer(
                    ngram_range=keyphrase_ngram_range,
                    stop_words=stop_words,
                    min_df=min_df,
                    vocabulary=candidates,
                    **lowercase=False**
                ).fit(docs)

Strangely enough it does seem to work on one of the virtual environments I've been using for a while, but I can't get it to work on newly installed environments even when I replicate it with the same versions of the packages installed. I expected the bug was in one of the installed packages, but this does not seem the case.

AroundtheGlobe avatar Dec 14 '22 14:12 AroundtheGlobe