BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Topic Modelling on longer documents

Open fojackson8 opened this issue 2 years ago • 8 comments

I'm hoping to use BERTopic for extracting topics from longer documents: >10 documents, each containing ~3-20 pages of text. Are there any special methods, tips or documentation on how to use BERTopic for such use cases?

Thanks for the excellent package!

fojackson8 avatar Sep 12 '23 11:09 fojackson8

I would advise converting those documents to sentences or paragraphs first before sending them to BERTopic. Since such a large document is likely to contain multiple topics, splitting them up would definitely help.

MaartenGr avatar Sep 12 '23 12:09 MaartenGr

Say we want to extract topics for the document as a whole. So we:

  1. Split into individual paragraphs
  2. Extract topics for each paragraph independently
  3. Somehow combine topics across paragraphs, to get document level topics.

I'm interested in how best to perform step 3. I guess we'd end up with paragraph level topic weightings if we take this approach, and it's not clear how best to combine these paragraph level results to an overall document level, ie what our the most prevalent topics overall in the document? I'm sure I can come up with an approach, but wanted to check whether's a recommended approach for this (probably fairly common) problem, or any examples you could point to?

fojackson8 avatar Sep 12 '23 12:09 fojackson8

You can aggregate the distribution according to the length of the text. The topic distribution is then simply the percentage of text that is classified as the topics.

MaartenGr avatar Sep 12 '23 15:09 MaartenGr

@MaartenGr Is there a code example for this aggregation step?

clstaudt avatar Sep 20 '23 08:09 clstaudt

@clstaudt There isn't but it should be relatively straightforward. You could save the results in a dataframe which would have sentences with their assigned topics and the ID of their document. Then, simply count how often a topic appears in each document based on the collection of sentences. Other than that, you could look at using .approximate_distribution

MaartenGr avatar Sep 20 '23 12:09 MaartenGr

Take a look at both of these. It helped me a ton https://medium.com/@armandj.olivares/using-bert-for-classifying-documents-with-long-texts-5c3e7b04573d

https://arxiv.org/abs/1910.10781

maticar92 avatar Oct 21 '23 03:10 maticar92

@MaartenGr I am interested in this. But one question: it seems to me that another good approach would be to

  1. split a long document into sentences
  2. compute embeddings for all of these sentences
  3. compute the mean embedding for the document as the "representative" embedding for the document

How can this be performed with bertopic? Does that make sense? Thanks!

randomgambit avatar Jan 11 '24 04:01 randomgambit

@randomgambit Splitting long documents is generally preferred before passing to BERTopic. Be careful though with merging the embeddings though as the mean embedding might be muddled if the sentences contain distinctly different topics.

MaartenGr avatar Jan 11 '24 06:01 MaartenGr