BERTopic
BERTopic copied to clipboard
Problems with merging topics
Have some problems with merging topics. When I try to merge_topics and visualise topics_over_time it rise IndexError. I think that cause topics do not update after merging. How to solve this problem?
Could you share your code for getting this error? Also, could you post the error output? That makes it a bit easier for me to understand the issue and how to properly fix it.
sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2") hdbscan_model = HDBSCAN(min_cluster_size=200, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=3) umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False, random_state=17) topic_model = BERTopic(embedding_model=sentence_model, diversity=0.3, hdbscan_model=hdbscan_model, umap_model=umap_model, top_n_words=30, nr_topics="auto", verbose=True,) topics, probs = topic_model.fit_transform(titles) vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopWords, tokenizer=LemmaTokenizer(), min_df=15) topic_model.update_topics(titles, topics, vectorizer_model=vectorizer_model)
topics_to_merge = [[69, 83],[218, 117],[96, 213],[30, 34],[28, 105, 177, 198, 205],[73, 125],[214, 22],[229, 46],[8, 65],[86, 67],[154, 134],[229, 140],[154, 141],[200, 167],[229, 172, 173, 185, 212],[7, 87, 199, 210, 220],[8, 217],[93, 222],[76, 120],[61, 110],[104, 159],[57, 207],[174, 6, 85],[9, 100],[107, 68],[10, 231],[193, 99],[64, 80]] topic_model.merge_topics(titles, topics, topics_to_merge)
topics_over_time = topic_model.topics_over_time(titles, topics, timestamps, nr_bins=30)
Raise exception
IndexError Traceback (most recent call last)
3 frames /usr/local/lib/python3.7/dist-packages/bertopic/_bertopic.py in topics_over_time(self, docs, topics, timestamps, nr_bins, datetime_format, evolution_tuning, global_tuning) 527 if global_tuning: 528 selected_topics = [all_topics_indices[topic] for topic in documents_per_topic.Topic.values] --> 529 c_tf_idf = (global_c_tf_idf[selected_topics] + c_tf_idf) / 2.0 530 531 # Extract the words per topic
/usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py in getitem(self, key) 31 """ 32 def getitem(self, key): ---> 33 row, col = self._validate_indices(key) 34 # Dispatch to specialized methods. 35 if isinstance(row, INT_TYPES):
/usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py in _validate_indices(self, key) 136 row += M 137 elif not isinstance(row, slice): --> 138 row = self._asindices(row, M) 139 140 if isintlike(col):
/usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py in _asindices(self, idx, length) 168 max_indx = x.max() 169 if max_indx >= length: --> 170 raise IndexError('index (%d) out of range' % max_indx) 171 172 min_indx = x.min()
IndexError: index (233) out of range
Sorry for the late reply. That seems to be an issue with the current version of BERTopic. In the upcoming release, this should be fixed! I expect it to be released in the coming weeks.
Hi MaartenGr,
I want to apologize for asking numerous questions. I again have a question, I want to merge topics from two different datasets using the code below from Tips and Tricks https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#finding-similar-topics-between-models
from bertopic import BERTopic from sentence_transformers import SentenceTransformer from bertopic import BERTopic from umap import UMAP
sentence_model = SentenceTransformer("paraphrase-MiniLM-L12-v2")
Week1
Chem1 = "Chemistry_dataset_for_week1" chem1_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model) chem1_model.fit(Chem1)
Week2
Chem2 = "Chemistry_dataset_for_week2" chem2_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model) chem2_model.fit(Chem2)
from sklearn.metrics.pairwise import cosine_similarity sim_matrix = cosine_similarity(chem1_model.topic_embeddings_, chem2_model.topic_embeddings_)
for i in range(0,len(sim_matrix)-1): for j in range(0,len(sim_matrix)-1): if sim_matrix[i][j]>0.9: #print(i, j, sim_matrix[i][j]) topics_to_merge = [i, j] new_topic=topic_model.merge_topics(docs, topics_to_merge)[But this function works when we have same dataset, I am working on different datasets and this merging process will go on for more weeks]
Expected Output-> Chem1_model[T1,T2,T3,] Chem2_model[T4,T5,T6,] Let us suppose, T1 and T5 are similar, merge these two into new topic [T1,5]. I need this similarity based on the embeddings, not on the matching pattern.
I hope, I could make my question understandable.
Could you please suggest to me such a method for doing this? I am sorry maybe I am asking too silly a question and for my bad coding.
Thank you!
@rubypnchl Merging topics from two different models is currently not possible. If you follow along with the description of BERTopic's algorithm this becomes quickly clear. Namely, we would have to combine two HDBSCAN models with each other, two UMAP models, two CountVectorizer representations, etc.
Instead, since you want to continuously add data to the model, you can look into incremental/online BERTopic instead. It is a method that allows you to add new data to the model whenever you want.
@rubypnchl Merging topics from two different models is currently not possible. If you follow along with the description of BERTopic's algorithm this becomes quickly clear. Namely, we would have to combine two HDBSCAN models with each other, two UMAP models, two CountVectorizer representations, etc.
Instead, since you want to continuously add data to the model, you can look into incremental/online BERTopic instead. It is a method that allows you to add new data to the model whenever you want.
Thank you so much for your kind reply. Actually, I have something different problem but I have got a fair idea about merging of topics concept. I will try to do it other way.