BERTopic
BERTopic copied to clipboard
Tracking the source index of new topics when merging models
Hi Maarten,
I'm playing with the merge_models feature, which is very useful, but I'm wondering if there is a way for a merged model to keep track the index of new topics added to it from their original models.
One use case of this is if I have some other metadata relating to my topics before the merge, and I want to link that metadata to the topics after the merge.
At the moment I'm doing
from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
import numpy as np
import pandas as pd
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
# calculate topic stat
topic_info_1 = topic_model_1.get_topic_info()
topic_info_1['Topic stat'] = np.random.randint(1, 10, topic_info_1.shape[0])
topic_info_2 = topic_model_2.get_topic_info()
topic_info_2['Topic stat'] = np.random.randint(1, 10, topic_info_2.shape[0])
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2], min_similarity=0.9)
merged_info = merged_model.get_topic_info()
# map new merged topics to original model 2
new_topic_nums = merged_info['Name'][len(topic_info_1):]
new_topic_nums = new_topic_nums.str.split("_", n=1).str[0].astype('int')
all_old_stats = topic_info_1['Topic stat']
selected_new_stats = topic_info_2['Topic stat'].loc[topic_info_2['Topic'].isin(new_topic_nums)]
merged_stats = pd.concat([all_old_stats, selected_new_stats]).tolist()
merged_info['Topic stat'] = merged_stats
But this seems really hacky and gets really complicated when merging more than 2 models.
It would be great to have a dictionary or something that mapped each sequential merge e.g.
{
"1": { # merge 1
"5": 53, # topic num in original model: topic num in merged model
"19: 54,
... },
"2": { # merge 2
"7": 61,
12: 62,
... },
...
}
or something like that. I know there's already a topic mapper used for other purposes. Not sure if that could be utilised?
Good question, that is currently not implemented I believe. What you could do is use the resulting .topics_
variable to keep track of the topics that were assigned before and after merging as I would think that would be more robust.
Thanks Maarten, I'll play around with that