BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Fixed issue #2144

Open PipaFlores opened this issue 1 year ago • 1 comments

What does this PR do?

Fixes #2144 (issue) by optimizing the topic extraction process when using fit_transform() with nr_topics="auto" or int for reducing topics. The main improvements are:

  1. Avoiding double calculation of topic representations, especially when using LLM-based representations.
  2. Introducing a calculate_representation parameter in the _extract_topics method to control when representations are calculated.
  3. Modifying the fit_transform method to calculate representations only after topic reduction (if needed).

These changes significantly improve performance, particularly when using computationally expensive representation models like LLMs.

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics. In this scenario, the current implementation extracts topics but doesn't sort the mappings by frequency. This results in a list of topics with unsorted indexes. While topic information is still displayed sorted by frequencies, the topic indexes may not match their frequency rank.

This is easily solvable by adding self._sort_mappings_by_frequency(documents) in def _reduce_topics() in line 4369. But runs into problems when pytest tests\test_representation\test_representations.py tests for the edge case self.nr_topics >= initial_nr_topics Thus, some fix in how this edge cases are tested is required.

Before submitting

  • [ ] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
    • Extra log messages were added to improve debugging
  • [x] Did you read the contributor guideline?
  • [x] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • [x] Did you make sure to update the documentation with your changes (if applicable)?
  • [x] Did you write any new necessary tests?
    • Test is successful as it is now. However, they do not account for the issue described above

PipaFlores avatar Oct 14 '24 14:10 PipaFlores

Ok, so I solved the issue with the test_representations.py::test_topic_reduction_edge_cases

It just needed one line of code to old_documents = topic_model._sort_mappings_by_frequency(old_documents) Therefore, the test will actually check for ordered, and unchanged, old/new topics

I hope is enough and self-explanatory. I forced pushed into the same commit to keep git history clean

PipaFlores avatar Oct 15 '24 09:10 PipaFlores

Awesome PR! Very clean and focused on the most important components here. I left a couple of minor questions that have mostly to do with variable naming.

One last question is whether that edge case still appears that we talked about previously. Was it fully prevented by adding what you showed here;

            else:
                logger.info(
                    f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())})."
                )
                documents = self._sort_mappings_by_frequency(documents)
                self._extract_topics(documents, verbose=self.verbose)
                return documents

Yes, that solves the edge case. The issue I had before was that test_representation.py would raise a false negative error, but i got that fixed by editing the test.

Regarding the other comments:

I think there is a higher level confusion between the documented modularity and the current logging verbosity.

In the code, and logger calls, the pipeline is defined as: Embeddings -> Dim. Red. -> Clustering -> Representation

However, in the documentation and paper, the pipeline is more detailed for the later steps as : Embeddings -> Dim. Red. -> Clustering -> Bag of Words (vectorizer + c-TF-IDF) -> Fine Tuning Representations

I could suggest a modification for the logging, if you would allow it, and put it here or in other PR. However, this requires some conceptual clarifications:

  1. When are topics "defined": In my head, topics are defined when clustering, as documents are assigned a topic ID. However, it also possible to make the case that the topics are defined at the end of the c-TF-IDF procedure, when we have a definition (based on top-n-keywords) for each topic. However, I can also imagine that the c-TF-IDF is not defining the topic itself, but just defining its "representation", hence the current logging of "Representation - Extracting topics from clusters" might make sense.

Also the word "extract" is a bit ambiguous for me.

  1. What counts as a representation? Do you consider c-TF-IDF a representation model (in this case, the base representation, as it is always calculated)? Since its a core part of the pipeline and is it always calculated, I would assign this procedure to its own category and make a distance from "representations" (as it is described also in the docs). Currently, there is no logging or mention of cTFIDF or bag-of-words in the code when verbose=true, as the log just skips from: BERTopic - Cluster - Completed ✓ BERTopic - Representation - Extracting topics from clusters using representation models.
    BERTopic - Representation - Completed ✓

I would propose aligning the logging to the documentation pipeline. As a rough example of a fit_transform with verbosity (ignoring the Completed ✓ for brevity): Embeddings - Transforming documents to embeddings Dim. Red. -> Fitting the dimensionality reduction algorithm Clustering -> Clustering the reduced embeddings Bag of Words -> Calculating c-TF-IDF base representation of topics Fine Tuning Representation -> Fine-tuning using representation model. Fine Tuning Representation -> Calculating different aspects.

As a non-CS researcher, I find it very helpful to understand and adapt the different steps of the implementation through the logging, that is also why I added the extra verbosity in self._extract_topics(documents, mappings=mappings, verbose=self.verbose)

Long story short, If what I reflected above make sense I can present some proposal on how to align the logger to the documented pipeline. This will probably help solve the other comments you made.

Otherwise, I can make small modifications for the specific comments you made, and come back with my logging suggestions in a new issue or PR

PipaFlores avatar Oct 21 '24 08:10 PipaFlores

Ok I closed this PR because i found a much cleaner solution. I will make a new one, you will notice it has a minimal change that might not even require major review. I will open the logging discussion as a new issue, to distinguish the two issues and make my point better.

PipaFlores avatar Oct 21 '24 11:10 PipaFlores