KeyBERT KeyPhrases Not Printing top 10.

Hello,

I am trying to print the top 10 key phrases from DatasetA['Description'] - it is a column with 4k text entries. However, I am getting list (print keyphrase) of all 3-6 grams phrases. No specific order. How do I ensure only top 10 is printed. Furthermore, how can I only print non-similar things (diversity). Thoughts?

from keybert import KeyBERT doc = DatasetA['Description'] model = KeyBERT('distilbert-base-nli-mean-tokens') keywords = kw_model.extract_keywords(doc) from keyphrase_vectorizers import KeyphraseCountVectorizer

kw_model.extract_keywords(docs=doc, vectorizer=KeyphraseCountVectorizer()) #model.extract_keyphrases(doc, keyphrase_ngram_range=(3, 6), stop_words=None, use_mmr=True, top_n=10) keyphrases = model.extract_keywords(doc, keyphrase_ngram_range=(3, 6), stop_words='english', use_maxsum=True, top_n=10) for keyphrase in keyphrases: print(keyphrase)

Mar 06 '23 23:03 bthapa94

To use diversity, you would have to use use_mmr=True together with diversity=0.5 or something higher to diversify the output. Furthermore, the model should return the top_n keyphrases if there are at least top_n keyphrases in the document. If not, less will be outputted.

Mar 07 '23 07:03 MaartenGr

Please see the output below. It is printing almost everything without sorting...thoughts?

Mar 07 '23 19:03 bthapa94

Based on your warning, did you make sure that you are using the most recent version of BERTopic? The most current version is v0.7.

Mar 09 '23 12:03 MaartenGr

So, if I you do .tolist, it will print top 10 of every row whereas .join will yield top 10 of the entire document.

text = ' '.join(DatasetA['Description']) vs. DatasetA['Description'].tolist()

Another question, how do I gather the bottom 10? Do you recommend diversity to 1? or closer to 1?

Mar 09 '23 22:03 bthapa94

You cannot get the bottom 10 as only the top words are provided. There is a chance of lower words getting higher with diversity=1 but there is no guarantee that you get all the bottom 10. Most likely, you will still get many high keywords, as it is typically the use case for extracting keywords. If you want the bottom 10, then those are typically stop words like "the", "and", "I", etc.

Mar 10 '23 06:03 MaartenGr

KeyBERT KeyBERT copied to clipboard

KeyPhrases Not Printing top 10.

KeyBERT
KeyBERT copied to clipboard