BERTopic
BERTopic copied to clipboard
v0.12
Highlights:
- Online/incremental topic modeling with
.partial_fit
- Expose c-TF-IDF model for customization with
bertopic.vectorizers.CTfidfTransformer
- Several parameters were added for potentially improved representations
-
bm25_weighting
-
reduce_frequent_words
- Expose attributes for easier access to internal data:
Online/Incremental topic modeling:
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic
# Prepare documents
docs = fetch_20newsgroups(subset=subset, remove=('headers', 'footers', 'quotes'))["data"]
# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
topic_model = BERTopic(umap_model=umap_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model)
# Incrementally fit the topic model by training on 1000 documents at a time
for index in range(0, len(docs), 1000):
topic_model.partial_fit(docs[index: index+1000])
c-TF-IDF model:
from bertopic import BERTopic
from bertopic.vectorizers import CTfidfTransformer()
ctfidf_model = CTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model =ctfidf_model )
Attributes:
Attribute | Description |
---|---|
.topics_ | The topics that are generated for each document after training or updating the topic model. |
topic_sizes_ | The size of each topic |
topic_mapper_ | A class for tracking topics and their mappings anytime they are merged/reduced. |
topic_representations_ | The top n terms per topic and their respective c-TF-IDF values. |
c_tf_idf_ | The topic-term matrix as calculated through c-TF-IDF. |
topic_labels_ | The default labels for each topic. |
custom_labels_ | Custom labels for each topic. |
topic_embeddings_ | The embeddings for each topic. |
representative_docs_ | The representative documents for each topic. |
Fixes:
- Fix #632 and #648