KeyBERT icon indicating copy to clipboard operation
KeyBERT copied to clipboard

KeyPhrases Not Printing top 10.

Open bthapa94 opened this issue 1 year ago • 5 comments

Hello,

I am trying to print the top 10 key phrases from DatasetA['Description'] - it is a column with 4k text entries. However, I am getting list (print keyphrase) of all 3-6 grams phrases. No specific order. How do I ensure only top 10 is printed. Furthermore, how can I only print non-similar things (diversity). Thoughts?

from keybert import KeyBERT doc = DatasetA['Description'] model = KeyBERT('distilbert-base-nli-mean-tokens') keywords = kw_model.extract_keywords(doc) from keyphrase_vectorizers import KeyphraseCountVectorizer

kw_model.extract_keywords(docs=doc, vectorizer=KeyphraseCountVectorizer()) #model.extract_keyphrases(doc, keyphrase_ngram_range=(3, 6), stop_words=None, use_mmr=True, top_n=10) keyphrases = model.extract_keywords(doc, keyphrase_ngram_range=(3, 6), stop_words='english', use_maxsum=True, top_n=10) for keyphrase in keyphrases: print(keyphrase)

bthapa94 avatar Mar 06 '23 23:03 bthapa94

To use diversity, you would have to use use_mmr=True together with diversity=0.5 or something higher to diversify the output. Furthermore, the model should return the top_n keyphrases if there are at least top_n keyphrases in the document. If not, less will be outputted.

MaartenGr avatar Mar 07 '23 07:03 MaartenGr

Please see the output below. It is printing almost everything without sorting...thoughts?

Screenshot 2023-03-07 145102

bthapa94 avatar Mar 07 '23 19:03 bthapa94

Based on your warning, did you make sure that you are using the most recent version of BERTopic? The most current version is v0.7.

MaartenGr avatar Mar 09 '23 12:03 MaartenGr

So, if I you do .tolist, it will print top 10 of every row whereas .join will yield top 10 of the entire document.

text = ' '.join(DatasetA['Description']) vs. DatasetA['Description'].tolist()

Another question, how do I gather the bottom 10? Do you recommend diversity to 1? or closer to 1?

bthapa94 avatar Mar 09 '23 22:03 bthapa94

You cannot get the bottom 10 as only the top words are provided. There is a chance of lower words getting higher with diversity=1 but there is no guarantee that you get all the bottom 10. Most likely, you will still get many high keywords, as it is typically the use case for extracting keywords. If you want the bottom 10, then those are typically stop words like "the", "and", "I", etc.

MaartenGr avatar Mar 10 '23 06:03 MaartenGr