BERTopic getting representative docs after online-fitting

Often partial_fit is invoked multiple times to process a large dataset.

It seems that the attribute representative_docs_ is not populated then. Is there an easy way to get representative docs in such a case?

Dec 08 '23 12:12 maciejskorski

You would have to use the internal functions to extract them. I believe there are a number of open issues that have some code related to this, so I would advise searching through those.

You could also use .merge_models for partial_fit-like functionality but I believe that method also does not save representative docs.

Dec 08 '23 13:12 MaartenGr

@MaartenGr thanks! Following your suggestion, here is a snippet using _create_topic_vectors and _save_representative_docs internal functions.

Suppose that docs are documents, embeds are their embeddings, topic_model is the model fitted online and train_idxs are the indexes in the shuffled order (if applicable). We populate topic representations first, then we are in a position to populate representative documents next:

doc_topic = pd.DataFrame({
  'Topic':topic_model.topics_,
  'ID':range(len(topic_model.topics_)),
  'Document':docs.loc[train_idxs]}
) # topics and docs combined, required by internal functions
topic_model._create_topic_vectors(doc_topic,embeds[train_idxs]) # populate topic embeddings
#topic_model._save_representative_docs(doc_topic)
repr_docs, _, _, _=  topic_model._extract_representative_docs(
    topic_model.c_tf_idf_, 
    doc_topic,
    topic_model.topic_representations_,
    nr_samples=1000,
    nr_repr_docs=5
)
topic_model.representative_docs_ = repr_docs

I tested this on >1M documents, here is an example:

Jan 11 '24 01:01 maciejskorski

That's great, thanks for sharing! Other users will definitely benefit from having this code snippet here.

Jan 11 '24 07:01 MaartenGr

That's great, thanks for sharing! Other users will definitely benefit from having this code snippet here.

@MaartenGr if you don't mind, I would volunteer to raise a PR expanding a bit the online tutorial example, demonstrating the use of these internal functions on News20?

Jan 11 '24 12:01 maciejskorski

[0] Lin, Xule reacted to your message:

From: Maciej Skorski @.> Sent: Thursday, January 11, 2024 1:12:02 AM To: MaartenGr/BERTopic @.> Cc: Subscribed @.***> Subject: Re: [MaartenGr/BERTopic] getting representative docs after online-fitting (Issue #1679)

@MaartenGrhttps://github.com/MaartenGr thanks! Following your suggestion, here is a snippet using _create_topic_vectors and _save_representative_docs internal functions.

Suppose that docs are documents, embeds are their embeddings, topic_model is the model fitted online and train_idxs are the indexes in the shuffled order (if applicable). We populate topic representations first, then we are in a position to populate representative documents next:

doc_topic = pd.DataFrame({'Topic':topic_model.topics_,'ID':range(len(topic_model.topics_)),'Document':docs.loc[train_idxs]}) topic_model._create_topic_vectors(doc_topic,embeds[train_idxs]) #topic_model._save_representative_docs(doc_topic) repr_docs, _, , = topic_model.extract_representative_docs( topic_model.c_tf_idf, doc_topic, topic_model.topic_representations, nr_samples=1000, nr_repr_docs=5 ) topic_model.representative_docs = repr_docs

I tested this on >1M documents, here is an example: image.png (view on web)https://github.com/MaartenGr/BERTopic/assets/31315784/93dadba8-5bfd-4f71-a3da-2a670a91ba9a

— Reply to this email directly, view it on GitHubhttps://github.com/MaartenGr/BERTopic/issues/1679#issuecomment-1886029573, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKJABPNZHMGROYTZVN2G76DYN436FAVCNFSM6AAAAABAMT3EOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGAZDSNJXGM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jan 11 '24 13:01 linxule

@maciejskorski Thanks for volunteering, definitely appreciate it! However, I am not sure whether something like this should have a place in the official documentation. Generally, it is unadvised to access private functions/attributes as they can easily change and break regardless of whether you use semantic versioning. As such, I cannot provide any official support for anything that is being done with private functions/attributes.

Instead, a function to populate the representative documents specifically could be exposed instead as an additional feature.

Jan 11 '24 16:01 MaartenGr

BERTopic BERTopic copied to clipboard

getting representative docs after online-fitting

BERTopic
BERTopic copied to clipboard