BERTopic
BERTopic copied to clipboard
Topic Modelling on longer documents
I'm hoping to use BERTopic for extracting topics from longer documents: >10 documents, each containing ~3-20 pages of text. Are there any special methods, tips or documentation on how to use BERTopic for such use cases?
Thanks for the excellent package!
I would advise converting those documents to sentences or paragraphs first before sending them to BERTopic. Since such a large document is likely to contain multiple topics, splitting them up would definitely help.
Say we want to extract topics for the document as a whole. So we:
- Split into individual paragraphs
- Extract topics for each paragraph independently
- Somehow combine topics across paragraphs, to get document level topics.
I'm interested in how best to perform step 3. I guess we'd end up with paragraph level topic weightings if we take this approach, and it's not clear how best to combine these paragraph level results to an overall document level, ie what our the most prevalent topics overall in the document? I'm sure I can come up with an approach, but wanted to check whether's a recommended approach for this (probably fairly common) problem, or any examples you could point to?
You can aggregate the distribution according to the length of the text. The topic distribution is then simply the percentage of text that is classified as the topics.
@MaartenGr Is there a code example for this aggregation step?
@clstaudt There isn't but it should be relatively straightforward. You could save the results in a dataframe which would have sentences with their assigned topics and the ID of their document. Then, simply count how often a topic appears in each document based on the collection of sentences. Other than that, you could look at using .approximate_distribution
Take a look at both of these. It helped me a ton https://medium.com/@armandj.olivares/using-bert-for-classifying-documents-with-long-texts-5c3e7b04573d
https://arxiv.org/abs/1910.10781
@MaartenGr I am interested in this. But one question: it seems to me that another good approach would be to
- split a long document into sentences
- compute embeddings for all of these sentences
- compute the mean embedding for the document as the "representative" embedding for the document
How can this be performed with bertopic? Does that make sense?
Thanks!
@randomgambit Splitting long documents is generally preferred before passing to BERTopic. Be careful though with merging the embeddings though as the mean embedding might be muddled if the sentences contain distinctly different topics.