BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

How to retrieve indexes of representative docs?

Open clstaudt opened this issue 2 years ago • 5 comments

The get_topic_info() method returns a dataframe with the column Representative_Docs, in which we find the content of documents as strings. How can I link them back to the training set? Can I retrieve their index in the list of training documents?

clstaudt avatar Oct 12 '23 11:10 clstaudt

Unfortunately, that is not easily possible without having to match the documents themselves with the representative documents. Other than that, you could take a look at the internal _extract_representative_docs function that creates the representative documents. It returns a number of things among which the indices I believe.

MaartenGr avatar Oct 12 '23 13:10 MaartenGr

Are the representative docs special (e.g. cluster centers or similar) or just random samples of documents from that topic?

clstaudt avatar Oct 12 '23 13:10 clstaudt

They are calculated by taking a random subset (500 documents) from each cluster and calculating their c-TF-IDF representations. Then, their cosine similarity is calculated with respect to the topic c-TF-IDF matrices. The most similar documents are selected and a small diversity is applied to prevent duplicates. You can find the full code here:

https://github.com/MaartenGr/BERTopic/blob/62e97ddea6cdcf9e4da25f9eaed478b22a9f9e20/bertopic/_bertopic.py#L3441

MaartenGr avatar Oct 12 '23 14:10 MaartenGr

What's wrong with this approach?:

-- Create Pandas DataFrame of single column; unique document id's assigned upstream, before BERTopic -- Each document (row) has its own unique Doc_ID key document_ids = document_df['Doc_ID'] -- Reset Pandas index to ensure the DataFrame index starts at zero document_ids.reset_index(drop=True, inplace=True)

-- Calculate embeddings model -- Perform BERTopic on documents

-- Append resulting topics (Cluster ID) back to original documents

topics_df = pd.DataFrame(topics, column=['Topics'])
documents_ids['Topics'] = topics_df

--- or alternative method --- new_df = pd.merge(document_ids, topics_df, left_index=True, right_index=True, how='inner')

-- LEFT JOIN 'new_df' back to original documents_df, on key = Doc_ID .... etc

I'm proposing this approach because, in my case, my documents actually go through a data-prepping and filtering process, where some document rows (sentences) don't survive to be processed downstream in BERTopic. This explains the reset_index() step, because the original sequential indexing gets disrupted and disjoint along the way, where as the BERTopic Cluster Index is not disrupted.

That being said, I am curious the function docs.index() function can also be used to append BERTopic results back to the original documents dataframe, for each separate document (row).

taylorshobe avatar Apr 02 '24 03:04 taylorshobe

What's wrong with this approach?:

I'm not seeing any error but this might work. Do note that I purposefully showcased the _extract_representative_docs method since that does not need strings to be matched. Preprocessing here should not be relevant since we are merely interested in the indices that are returned, which you can then match to your original documents that you used before processing.

MaartenGr avatar Apr 03 '24 06:04 MaartenGr