BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Set random seed in `hierarchical_topics`?

Open serenalotreck opened this issue 3 months ago • 9 comments

I've set the random seed when I fit my topic model, and I'm getting reproducible results. I'm using the following:

def fit_reduce_model(rep_model, docs):
    """
    Defines all component models internally besides the representation model, which is the only one that changes.
    Pre-calculates embeddings, fits model, and performs outlier reduction.

    parameters:
        rep_model, class instance from bertopic.representation: representation model
        docs, list of str: documents to model

    returns:
        topic_model, BERTopic model: fitted model with outliers reduced
    """
    # Define all component models
    print('Defining component models...')
    sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    ## Using default HDBSCAN model, no definition needed
    vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
    representation_model=rep_model
    
    # Pre-calculate embeddings
    print('Calculating embeddings...')
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # We reduce our embeddings to 2D as it will allows us to quickly iterate later on
    reduced_embeddings = umap_model.fit_transform(embeddings)
    
    # Fit the model
    print('Fitting model...')
    topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
    topics, probs = topic_model.fit_transform(docs, embeddings)
    
    # Reduce outliers
    print('Reducing outliers...')
    new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
    topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)
    
    return topic_model

However, when I run the following, I get varied results:

# Fit the model
mmr_rep_model = MaximalMarginalRelevance(diversity=0.3)
mmr_model = fit_reduce_model(mmr_rep_model, docs)

# Generate hierarchical topics
hierarchical_topics = mmr_model.hierarchical_topics(docs)
fig = mmr_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()

I don't see a way in the docs to set a random seed for hierarchical_topics; let me know if I've overlooked something!

serenalotreck avatar Apr 01 '24 21:04 serenalotreck