BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

v0.12

Open MaartenGr opened this issue 1 year ago • 0 comments

Highlights:

  • Online/incremental topic modeling with .partial_fit
  • Expose c-TF-IDF model for customization with bertopic.vectorizers.CTfidfTransformer
    • Several parameters were added for potentially improved representations
    • bm25_weighting
    • reduce_frequent_words
  • Expose attributes for easier access to internal data:

Online/Incremental topic modeling:

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic

# Prepare documents
docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))["data"]

# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=umap_model,
                       hdbscan_model=cluster_model,
                       vectorizer_model=vectorizer_model)

# Incrementally fit the topic model by training on 1000 documents at a time
for index in range(0, len(docs), 1000):
    topic_model.partial_fit(docs[index: index+1000])

c-TF-IDF model:

from bertopic import BERTopic
from bertopic.vectorizers import CTfidfTransformer()

ctfidf_model = CTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model =ctfidf_model )

Attributes:

Attribute Description
.topics_ The topics that are generated for each document after training or updating the topic model.
topic_sizes_ The size of each topic
topic_mapper_ A class for tracking topics and their mappings anytime they are merged/reduced.
topic_representations_ The top n terms per topic and their respective c-TF-IDF values.
c_tf_idf_ The topic-term matrix as calculated through c-TF-IDF.
topic_labels_ The default labels for each topic.
custom_labels_ Custom labels for each topic.
topic_embeddings_ The embeddings for each topic.
representative_docs_ The representative documents for each topic.

Fixes:

  • Fix #632 and #648

MaartenGr avatar Aug 10 '22 08:08 MaartenGr