BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

min_topic_size is not impacting the number of topics in Latest Version Version 0.12.0

Open rubypnchl opened this issue 2 years ago • 4 comments

Hi MaartenGr,

Since I have upgraded BERTopic for cosine similarity between topics. In the new version, the parameter min_topic_size doesn't have any impact on the number of topics. If I understood it right, min_topic_size is the minimum number of documents to be part of a topic. However, I tried multiple runs with different document sets [10, 100, 1000] and min_topic_size[1,5,10,50,100] for each set but it doesn't show any effect over the number of topics. Example: number of documents=100, min_topic_size=1000, number of topics are=19, and image

Where is the issue, I do not understand. Please look into this.

Many Thanks

rubypnchl avatar Sep 17 '22 20:09 rubypnchl

Could you share your code for training BERTopic? The code would help me identify where the issue might stem from. The parameter you refer to was not changed between versions I believe so I would not expect any issues there.

MaartenGr avatar Sep 17 '22 20:09 MaartenGr

Could you share your code for training BERTopic? The code would help me identify where the issue might stem from. The parameter you refer to was not changed between versions I believe so I would not expect any issues there.

I am performing parameter tuning so, Earlier min_topic_size was showing its impact but not now and I have all the parameters same between versions except min_topic_size.

from hdbscan import HDBSCAN from bertopic import BERTopic from umap import UMAP from keyphrase_vectorizers import KeyphraseCountVectorizer from bertopic.vectorizers import ClassTfidfTransformer

sentence_model = SentenceTransformer("paraphrase-MiniLM-L12-v2") hdbscan_model = HDBSCAN( min_cluster_size=3, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=1) ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)

umap_model = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine') topic_model = BERTopic(ctfidf_model=ctfidf_model, embedding_model=sentence_model, top_n_words=10, verbose=True, min_topic_size=1000, vectorizer_model=vectorizer_model, low_memory=True, calculate_probabilities=False, diversity=0.2, hdbscan_model=hdbscan_model, umap_model=umap_model)

print("modeling done")

topics, probs = topic_model.fit_transform(abstracts) print("prob calculated") topic_model.save("topic_model_hdbscan_sent01")

print("number of topics in the dataset",len(topic_model.get_topic_info()))

Here, the number of abstracts is 100.

rubypnchl avatar Sep 17 '22 20:09 rubypnchl

When you use min_topic_size you are essentially setting the min_cluster_size parameter in HDBSCAN. So if you are using a custom HDBSCAN model, min_topic_size is not used and replaced by the min_cluster_size parameter in your HDBSCAN model. In other words, they are the same parameter and the one in the HDBSCAN model overwrites the one you set in BERTopic.

MaartenGr avatar Sep 18 '22 02:09 MaartenGr

When you use min_topic_size you are essentially setting the min_cluster_size parameter in HDBSCAN. So if you are using a custom HDBSCAN model, min_topic_size is not used and replaced by the min_cluster_size parameter in your HDBSCAN model. In other words, they are the same parameter and the one in the HDBSCAN model overwrites the one you set in BERTopic.

I agree with you, min_topic_size is replaced by min_cluster_size in the new version but in the earlier version, it has its own impact on the number of topics because I have performed several experiments by taking min_cluster_size, min_topic_size, min_samle, n_neighbours, n_components into consideration to reduce the percentage of outlier :D

By the way, thank you so much for clearing it out. You have seriously developed a fantastic library!!!

rubypnchl avatar Sep 18 '22 10:09 rubypnchl