BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Tracking the source index of new topics when merging models

Open zilch42 opened this issue 10 months ago • 2 comments

Hi Maarten,

I'm playing with the merge_models feature, which is very useful, but I'm wondering if there is a way for a merged model to keep track the index of new topics added to it from their original models.

One use case of this is if I have some other metadata relating to my topics before the merge, and I want to link that metadata to the topics after the merge.

At the moment I'm doing

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
import numpy as np
import pandas as pd

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)

# calculate topic stat
topic_info_1 = topic_model_1.get_topic_info()
topic_info_1['Topic stat'] = np.random.randint(1, 10, topic_info_1.shape[0])
topic_info_2 = topic_model_2.get_topic_info()
topic_info_2['Topic stat'] = np.random.randint(1, 10, topic_info_2.shape[0])

merged_model = BERTopic.merge_models([topic_model_1, topic_model_2], min_similarity=0.9)
merged_info = merged_model.get_topic_info()

# map new merged topics to original model 2 
new_topic_nums = merged_info['Name'][len(topic_info_1):]
new_topic_nums = new_topic_nums.str.split("_", n=1).str[0].astype('int')
all_old_stats = topic_info_1['Topic stat']
selected_new_stats = topic_info_2['Topic stat'].loc[topic_info_2['Topic'].isin(new_topic_nums)]
merged_stats = pd.concat([all_old_stats, selected_new_stats]).tolist()
merged_info['Topic stat'] = merged_stats

But this seems really hacky and gets really complicated when merging more than 2 models.

It would be great to have a dictionary or something that mapped each sequential merge e.g.

{
    "1": {         # merge 1
        "5": 53,       # topic num in original model: topic num in merged model
        "19: 54,
        ... },
    "2": {         # merge 2
        "7": 61,
        12: 62,
        ... },
    ...
}

or something like that. I know there's already a topic mapper used for other purposes. Not sure if that could be utilised?

zilch42 avatar Mar 27 '24 04:03 zilch42

Good question, that is currently not implemented I believe. What you could do is use the resulting .topics_ variable to keep track of the topics that were assigned before and after merging as I would think that would be more robust.

MaartenGr avatar Mar 29 '24 07:03 MaartenGr

Thanks Maarten, I'll play around with that

zilch42 avatar Apr 01 '24 23:04 zilch42