Why are the results of running the code before and after different
Have you searched existing issues? 🔎
- [x] I have searched and found no existing issues
Desribe the bug
The number of topics obtained from bertopic running in November 2024 does not match the recent number of topics, and my code has not changed at all
Reproduction
from bertopic import BERTopic
BERTopic Version
0.16.4
Have you read this?
I have read about the setting of the random_date parameter, random_state=0
Could you share your full code? What you shared doesn't help me understand where you set that parameter, how you set it, which environment you are using, etc.
Also, did you update anything in your environment? Any changes (including dependencies, environments, OS) is likely to affect this.
I haven't tampered with any code, nor have I changed the operating system. I will send you my complete analysis code
import pandas as pd from bertopic import BERTopic from sentence_transformers import SentenceTransformer import matplotlib.pyplot as plt from matplotlib import rcParams import numpy as np import os from bertopic.vectorizers import ClassTfidfTransformer from bertopic.representation import MaximalMarginalRelevance from umap import UMAP from hdbscan import HDBSCAN import datamapplot pd.set_option('display.max_rows', 20) data=pd.read_csv("data/data_sentences_clean.csv") timestamps=data["year"].tolist() sentences=data["Segmented_Sentence"].tolist() model_name =r'D:\big_models\acge_text_embedding' if not model_name: raise ValueError("模型名称不能为空。") try: transformer_model = SentenceTransformer(model_name) print("模型加载成功!") except Exception as e: print(f"模型加载失败: {e}") exit() embeddings=np.load("embeddings/embeddings.npy") umap_model=UMAP( n_neighbors=40, n_components=15,#太低信息丢失,太高之后又难以聚类 min_dist=0.0, metric='cosine', #prediction_data=True, random_state=0 ) hdbscan_model = HDBSCAN( min_cluster_size=250, min_samples=16,) ctfidf_model=ClassTfidfTransformer(reduce_frequent_words=True) representation_model = MaximalMarginalRelevance(diversity=0.5) try: topic_model = BERTopic(embedding_model=transformer_model, min_topic_size=3, top_n_words=25, verbose=True, umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, representation_model=representation_model, ) topics,probs= topic_model.fit_transform(sentences,embeddings=embeddings) #df['Topic'] = topics except Exception as e: print(f"BERTopic 处理失败: {e}") exit() topic_model.get_topic_info()
It's a bit hard to read your code since there is no indentation. Also, could you format it with ``` blocks?
That said, it seems like your code should be reproducible. If you run it multiple times, do you get the same results? If so, then something must be different between then and now.