BERTopic
BERTopic copied to clipboard
Set random seed in `hierarchical_topics`?
I've set the random seed when I fit my topic model, and I'm getting reproducible results. I'm using the following:
def fit_reduce_model(rep_model, docs):
"""
Defines all component models internally besides the representation model, which is the only one that changes.
Pre-calculates embeddings, fits model, and performs outlier reduction.
parameters:
rep_model, class instance from bertopic.representation: representation model
docs, list of str: documents to model
returns:
topic_model, BERTopic model: fitted model with outliers reduced
"""
# Define all component models
print('Defining component models...')
sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
## Using default HDBSCAN model, no definition needed
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
representation_model=rep_model
# Pre-calculate embeddings
print('Calculating embeddings...')
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
reduced_embeddings = umap_model.fit_transform(embeddings)
# Fit the model
print('Fitting model...')
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs, embeddings)
# Reduce outliers
print('Reducing outliers...')
new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)
return topic_model
However, when I run the following, I get varied results:
# Fit the model
mmr_rep_model = MaximalMarginalRelevance(diversity=0.3)
mmr_model = fit_reduce_model(mmr_rep_model, docs)
# Generate hierarchical topics
hierarchical_topics = mmr_model.hierarchical_topics(docs)
fig = mmr_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()
I don't see a way in the docs to set a random seed for hierarchical_topics
; let me know if I've overlooked something!