BERTopic
BERTopic copied to clipboard
min_topic_size is not impacting the number of topics in Latest Version Version 0.12.0
Hi MaartenGr,
Since I have upgraded BERTopic for cosine similarity between topics. In the new version, the parameter min_topic_size doesn't have any impact on the number of topics. If I understood it right, min_topic_size is the minimum number of documents to be part of a topic. However, I tried multiple runs with different document sets [10, 100, 1000] and min_topic_size[1,5,10,50,100] for each set but it doesn't show any effect over the number of topics.
Example: number of documents=100, min_topic_size=1000, number of topics are=19, and
Where is the issue, I do not understand. Please look into this.
Many Thanks
Could you share your code for training BERTopic? The code would help me identify where the issue might stem from. The parameter you refer to was not changed between versions I believe so I would not expect any issues there.
Could you share your code for training BERTopic? The code would help me identify where the issue might stem from. The parameter you refer to was not changed between versions I believe so I would not expect any issues there.
I am performing parameter tuning so, Earlier min_topic_size was showing its impact but not now and I have all the parameters same between versions except min_topic_size.
from hdbscan import HDBSCAN from bertopic import BERTopic from umap import UMAP from keyphrase_vectorizers import KeyphraseCountVectorizer from bertopic.vectorizers import ClassTfidfTransformer
sentence_model = SentenceTransformer("paraphrase-MiniLM-L12-v2") hdbscan_model = HDBSCAN( min_cluster_size=3, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=1) ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)
umap_model = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine') topic_model = BERTopic(ctfidf_model=ctfidf_model, embedding_model=sentence_model, top_n_words=10, verbose=True, min_topic_size=1000, vectorizer_model=vectorizer_model, low_memory=True, calculate_probabilities=False, diversity=0.2, hdbscan_model=hdbscan_model, umap_model=umap_model)
print("modeling done")
topics, probs = topic_model.fit_transform(abstracts) print("prob calculated") topic_model.save("topic_model_hdbscan_sent01")
print("number of topics in the dataset",len(topic_model.get_topic_info()))
Here, the number of abstracts is 100.
When you use min_topic_size
you are essentially setting the min_cluster_size
parameter in HDBSCAN. So if you are using a custom HDBSCAN model, min_topic_size
is not used and replaced by the min_cluster_size
parameter in your HDBSCAN model. In other words, they are the same parameter and the one in the HDBSCAN model overwrites the one you set in BERTopic.
When you use
min_topic_size
you are essentially setting themin_cluster_size
parameter in HDBSCAN. So if you are using a custom HDBSCAN model,min_topic_size
is not used and replaced by themin_cluster_size
parameter in your HDBSCAN model. In other words, they are the same parameter and the one in the HDBSCAN model overwrites the one you set in BERTopic.
I agree with you, min_topic_size is replaced by min_cluster_size in the new version but in the earlier version, it has its own impact on the number of topics because I have performed several experiments by taking min_cluster_size, min_topic_size, min_samle, n_neighbours, n_components into consideration to reduce the percentage of outlier :D
By the way, thank you so much for clearing it out. You have seriously developed a fantastic library!!!