KeyBERT
KeyBERT copied to clipboard
KeyPhrases Not Printing top 10.
Hello,
I am trying to print the top 10 key phrases from DatasetA['Description'] - it is a column with 4k text entries. However, I am getting list (print keyphrase) of all 3-6 grams phrases. No specific order. How do I ensure only top 10 is printed. Furthermore, how can I only print non-similar things (diversity). Thoughts?
from keybert import KeyBERT doc = DatasetA['Description'] model = KeyBERT('distilbert-base-nli-mean-tokens') keywords = kw_model.extract_keywords(doc) from keyphrase_vectorizers import KeyphraseCountVectorizer
kw_model.extract_keywords(docs=doc, vectorizer=KeyphraseCountVectorizer()) #model.extract_keyphrases(doc, keyphrase_ngram_range=(3, 6), stop_words=None, use_mmr=True, top_n=10) keyphrases = model.extract_keywords(doc, keyphrase_ngram_range=(3, 6), stop_words='english', use_maxsum=True, top_n=10) for keyphrase in keyphrases: print(keyphrase)
To use diversity, you would have to use use_mmr=True
together with diversity=0.5
or something higher to diversify the output. Furthermore, the model should return the top_n
keyphrases if there are at least top_n
keyphrases in the document. If not, less will be outputted.
Please see the output below. It is printing almost everything without sorting...thoughts?
data:image/s3,"s3://crabby-images/3aadc/3aadcbc1234a9fc6d91ef79257f35aaa4274a625" alt="Screenshot 2023-03-07 145102"
Based on your warning, did you make sure that you are using the most recent version of BERTopic? The most current version is v0.7.
So, if I you do .tolist, it will print top 10 of every row whereas .join will yield top 10 of the entire document.
text = ' '.join(DatasetA['Description']) vs. DatasetA['Description'].tolist()
Another question, how do I gather the bottom 10? Do you recommend diversity to 1? or closer to 1?
You cannot get the bottom 10 as only the top words are provided. There is a chance of lower words getting higher with diversity=1 but there is no guarantee that you get all the bottom 10. Most likely, you will still get many high keywords, as it is typically the use case for extracting keywords. If you want the bottom 10, then those are typically stop words like "the", "and", "I", etc.