BERTopic
BERTopic copied to clipboard
results of `transform` is differnet from merged topic model `get_topic_info()` output
Edit: I have surprisingly missed both the topics_
atrribute and the get_document_info()
method. My question is changed a little bit and I am now wondering why the transform is different to the original assignment on training?
I have just noticed a problem I am having where the outputs from the transform don't match the the counts from the get_topic_info()
method.
That is that the counts of how many documents in a topic are not consitant.
Here is a Minimum Reproducible example:
from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:500]
abstracts_2 = dataset["abstract"][500:1000]
abstracts_3 = dataset["abstract"][1000:1500]
# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)
# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])
display(merged_model.get_topic_info())
all_abstracts = pd.DataFrame({'documents': abstracts_1 + abstracts_2 + abstracts_3})
all_abstracts['topic'] = merged_model.transform(all_abstracts['documents'])[0]
display(all_abstracts['topic'].value_counts())
Here is my workling example
embeddings = all_embeddings['voyageai'].copy()
display(embeddings)
mode_groups = embeddings.groupby('mode')
mode_dfs = [mode_groups.get_group(i).reset_index(drop=True) for i in range(3)]
mode_models = [BERTopic() for _ in mode_dfs]
for model, df in zip(mode_models, mode_dfs):
model.fit_transform(
df['si'],
np.array([np.array(x) for x in df['si_embedding'].to_numpy()])
)
display(model.get_topic_info())
merged_model = BERTopic.merge_models(mode_models, min_similarity=0.9)
display(merged_model.get_topic_info())
embeddings['topic'] = merged_model.transform(embeddings['si'], np.array([np.array(x) for x in embeddings['si_embedding'].to_numpy()]))[0]
embeddings['topic'].value_counts()
Output:
What am I missing and why can the topic assignment be so different from the merged model and the transformed values. Furthermore am I missing how I should be getting the topics for the original documents?