BERTopic Topic Modelling on longer documents

I'm hoping to use BERTopic for extracting topics from longer documents: >10 documents, each containing ~3-20 pages of text. Are there any special methods, tips or documentation on how to use BERTopic for such use cases?

Thanks for the excellent package!

Sep 12 '23 11:09 fojackson8

I would advise converting those documents to sentences or paragraphs first before sending them to BERTopic. Since such a large document is likely to contain multiple topics, splitting them up would definitely help.

Sep 12 '23 12:09 MaartenGr

Say we want to extract topics for the document as a whole. So we:

Split into individual paragraphs
Extract topics for each paragraph independently
Somehow combine topics across paragraphs, to get document level topics.

I'm interested in how best to perform step 3. I guess we'd end up with paragraph level topic weightings if we take this approach, and it's not clear how best to combine these paragraph level results to an overall document level, ie what our the most prevalent topics overall in the document? I'm sure I can come up with an approach, but wanted to check whether's a recommended approach for this (probably fairly common) problem, or any examples you could point to?

Sep 12 '23 12:09 fojackson8

You can aggregate the distribution according to the length of the text. The topic distribution is then simply the percentage of text that is classified as the topics.

Sep 12 '23 15:09 MaartenGr

@MaartenGr Is there a code example for this aggregation step?

Sep 20 '23 08:09 clstaudt

@clstaudt There isn't but it should be relatively straightforward. You could save the results in a dataframe which would have sentences with their assigned topics and the ID of their document. Then, simply count how often a topic appears in each document based on the collection of sentences. Other than that, you could look at using .approximate_distribution

Sep 20 '23 12:09 MaartenGr

Take a look at both of these. It helped me a ton https://medium.com/@armandj.olivares/using-bert-for-classifying-documents-with-long-texts-5c3e7b04573d

https://arxiv.org/abs/1910.10781

Oct 21 '23 03:10 maticar92

@MaartenGr I am interested in this. But one question: it seems to me that another good approach would be to

split a long document into sentences
compute embeddings for all of these sentences
compute the mean embedding for the document as the "representative" embedding for the document

How can this be performed with bertopic? Does that make sense? Thanks!

Jan 11 '24 04:01 randomgambit

@randomgambit Splitting long documents is generally preferred before passing to BERTopic. Be careful though with merging the embeddings though as the mean embedding might be muddled if the sentences contain distinctly different topics.

Jan 11 '24 06:01 MaartenGr

BERTopic BERTopic copied to clipboard

Topic Modelling on longer documents

BERTopic
BERTopic copied to clipboard