BERTopic
BERTopic copied to clipboard
getting representative docs after online-fitting
Often partial_fit
is invoked multiple times to process a large dataset.
It seems that the attribute representative_docs_
is not populated then.
Is there an easy way to get representative docs in such a case?
You would have to use the internal functions to extract them. I believe there are a number of open issues that have some code related to this, so I would advise searching through those.
You could also use .merge_models
for partial_fit-like functionality but I believe that method also does not save representative docs.
@MaartenGr thanks! Following your suggestion, here is a snippet using _create_topic_vectors
and _save_representative_docs
internal functions.
Suppose that docs
are documents, embeds
are their embeddings, topic_model
is the model fitted online and train_idxs
are the indexes in the shuffled order (if applicable). We populate topic representations first, then we are in a position to populate representative documents next:
doc_topic = pd.DataFrame({
'Topic':topic_model.topics_,
'ID':range(len(topic_model.topics_)),
'Document':docs.loc[train_idxs]}
) # topics and docs combined, required by internal functions
topic_model._create_topic_vectors(doc_topic,embeds[train_idxs]) # populate topic embeddings
#topic_model._save_representative_docs(doc_topic)
repr_docs, _, _, _= topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
doc_topic,
topic_model.topic_representations_,
nr_samples=1000,
nr_repr_docs=5
)
topic_model.representative_docs_ = repr_docs
I tested this on >1M documents, here is an example:
That's great, thanks for sharing! Other users will definitely benefit from having this code snippet here.
That's great, thanks for sharing! Other users will definitely benefit from having this code snippet here.
@MaartenGr if you don't mind, I would volunteer to raise a PR expanding a bit the online tutorial example, demonstrating the use of these internal functions on News20?
[0] Lin, Xule reacted to your message:
From: Maciej Skorski @.> Sent: Thursday, January 11, 2024 1:12:02 AM To: MaartenGr/BERTopic @.> Cc: Subscribed @.***> Subject: Re: [MaartenGr/BERTopic] getting representative docs after online-fitting (Issue #1679)
@MaartenGrhttps://github.com/MaartenGr thanks! Following your suggestion, here is a snippet using _create_topic_vectors and _save_representative_docs internal functions.
Suppose that docs are documents, embeds are their embeddings, topic_model is the model fitted online and train_idxs are the indexes in the shuffled order (if applicable). We populate topic representations first, then we are in a position to populate representative documents next:
doc_topic = pd.DataFrame({'Topic':topic_model.topics_,'ID':range(len(topic_model.topics_)),'Document':docs.loc[train_idxs]}) topic_model._create_topic_vectors(doc_topic,embeds[train_idxs]) #topic_model._save_representative_docs(doc_topic) repr_docs, _, , = topic_model.extract_representative_docs( topic_model.c_tf_idf, doc_topic, topic_model.topic_representations, nr_samples=1000, nr_repr_docs=5 ) topic_model.representative_docs = repr_docs
I tested this on >1M documents, here is an example: image.png (view on web)https://github.com/MaartenGr/BERTopic/assets/31315784/93dadba8-5bfd-4f71-a3da-2a670a91ba9a
— Reply to this email directly, view it on GitHubhttps://github.com/MaartenGr/BERTopic/issues/1679#issuecomment-1886029573, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKJABPNZHMGROYTZVN2G76DYN436FAVCNFSM6AAAAABAMT3EOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGAZDSNJXGM. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@maciejskorski Thanks for volunteering, definitely appreciate it! However, I am not sure whether something like this should have a place in the official documentation. Generally, it is unadvised to access private functions/attributes as they can easily change and break regardless of whether you use semantic versioning. As such, I cannot provide any official support for anything that is being done with private functions/attributes.
Instead, a function to populate the representative documents specifically could be exposed instead as an additional feature.