BERTopic results of `transform` is differnet from merged topic model `get_topic

results of `transform` is differnet from merged topic model `get_topic_info()` output

Open 1jamesthompson1 opened this issue 8 months ago • 1 comments

Edit: I have surprisingly missed both the topics_ atrribute and the get_document_info() method. My question is changed a little bit and I am now wondering why the transform is different to the original assignment on training?

I have just noticed a problem I am having where the outputs from the transform don't match the the counts from the get_topic_info() method.

That is that the counts of how many documents in a topic are not consitant.

Here is a Minimum Reproducible example:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:500]
abstracts_2 = dataset["abstract"][500:1000]
abstracts_3 = dataset["abstract"][1000:1500]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])

display(merged_model.get_topic_info())

all_abstracts = pd.DataFrame({'documents': abstracts_1 + abstracts_2 + abstracts_3})
all_abstracts['topic'] = merged_model.transform(all_abstracts['documents'])[0]

display(all_abstracts['topic'].value_counts())

Here is my workling example

embeddings = all_embeddings['voyageai'].copy()

display(embeddings)

mode_groups = embeddings.groupby('mode')
mode_dfs = [mode_groups.get_group(i).reset_index(drop=True) for i in range(3)]

mode_models = [BERTopic() for _ in mode_dfs]

for model, df in zip(mode_models, mode_dfs):
    model.fit_transform(
        df['si'],
        np.array([np.array(x) for x in df['si_embedding'].to_numpy()])
)
    display(model.get_topic_info())

merged_model = BERTopic.merge_models(mode_models, min_similarity=0.9)

display(merged_model.get_topic_info())

embeddings['topic'] = merged_model.transform(embeddings['si'], np.array([np.array(x) for x in embeddings['si_embedding'].to_numpy()]))[0]

embeddings['topic'].value_counts()

Output:

What am I missing and why can the topic assignment be so different from the merged model and the transformed values. Furthermore am I missing how I should be getting the topics for the original documents?

May 29 '24 08:05 1jamesthompson1

BERTopic BERTopic copied to clipboard

results of `transform` is differnet from merged topic model `get_topic_info()` output

BERTopic
BERTopic copied to clipboard