BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Why are the results of running the code before and after different

Open superseanyoung opened this issue 9 months ago • 6 comments

Have you searched existing issues? 🔎

  • [x] I have searched and found no existing issues

Desribe the bug

The number of topics obtained from bertopic running in November 2024 does not match the recent number of topics, and my code has not changed at all

Reproduction

from bertopic import BERTopic

BERTopic Version

0.16.4

superseanyoung avatar Mar 19 '25 04:03 superseanyoung

Have you read this?

MaartenGr avatar Mar 19 '25 07:03 MaartenGr

I have read about the setting of the random_date parameter, random_state=0

superseanyoung avatar Mar 19 '25 07:03 superseanyoung

Could you share your full code? What you shared doesn't help me understand where you set that parameter, how you set it, which environment you are using, etc.

Also, did you update anything in your environment? Any changes (including dependencies, environments, OS) is likely to affect this.

MaartenGr avatar Mar 19 '25 08:03 MaartenGr

I haven't tampered with any code, nor have I changed the operating system. I will send you my complete analysis code

superseanyoung avatar Mar 19 '25 08:03 superseanyoung

import pandas as pd from bertopic import BERTopic from sentence_transformers import SentenceTransformer import matplotlib.pyplot as plt from matplotlib import rcParams import numpy as np import os from bertopic.vectorizers import ClassTfidfTransformer from bertopic.representation import MaximalMarginalRelevance from umap import UMAP from hdbscan import HDBSCAN import datamapplot pd.set_option('display.max_rows', 20) data=pd.read_csv("data/data_sentences_clean.csv") timestamps=data["year"].tolist() sentences=data["Segmented_Sentence"].tolist() model_name =r'D:\big_models\acge_text_embedding' if not model_name: raise ValueError("模型名称不能为空。") try: transformer_model = SentenceTransformer(model_name) print("模型加载成功!") except Exception as e: print(f"模型加载失败: {e}") exit() embeddings=np.load("embeddings/embeddings.npy") umap_model=UMAP( n_neighbors=40, n_components=15,#太低信息丢失,太高之后又难以聚类 min_dist=0.0, metric='cosine', #prediction_data=True, random_state=0 ) hdbscan_model = HDBSCAN( min_cluster_size=250, min_samples=16,) ctfidf_model=ClassTfidfTransformer(reduce_frequent_words=True) representation_model = MaximalMarginalRelevance(diversity=0.5) try: topic_model = BERTopic(embedding_model=transformer_model, min_topic_size=3, top_n_words=25, verbose=True, umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, representation_model=representation_model, ) topics,probs= topic_model.fit_transform(sentences,embeddings=embeddings) #df['Topic'] = topics except Exception as e: print(f"BERTopic 处理失败: {e}") exit() topic_model.get_topic_info()

superseanyoung avatar Mar 19 '25 08:03 superseanyoung

It's a bit hard to read your code since there is no indentation. Also, could you format it with ``` blocks?

That said, it seems like your code should be reproducible. If you run it multiple times, do you get the same results? If so, then something must be different between then and now.

MaartenGr avatar Mar 19 '25 11:03 MaartenGr