Pablo Flores comments

Results 16 comments of


                                            Pablo Flores

Align Logger Output with Documented Pipeline

So I made a draft now. The logging looks like > 2024-10-28 10:18:43,031 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-10-28 10:18:54,653 - BERTopic - Dimensionality -...

Align Logger Output with Documented Pipeline

> I would prefer to combine the first two Bag of Words loggers. Simply stating that the vectorizer is updated seems enough to me as there is only _preprocess_text between...

Systematic test units for fit_transform()

It might be that is less dramatic than I portray it to be. In the coverage html file for `_bertopic.py` (html cov zip file below), most of the pipeline is...

Systematic test units for fit_transform()

For the 'standard' procedure of `fit_transform()`, couldn't we just use the same sub-sample of the 20newsgroup and use a seed for the UMAP? From my understanding, the UMAP is the...

metric= "cosine" error reported

Getting the same error, here is my full log. Im guessing yours is the same ```python # Define hdbscan clustering model, also different for data types # Pre-defined parameters (min...

Representations from representation_models are generated twice when nr_topics="auto".

Edit 2: I succesfully ran the test now. I will check how to do the PR, maybe Monday next week Edit: I managed to run the tests, and its failing...

Representations from representation_models are generated twice when nr_topics="auto".

Wouldn't that affect when other instances are calling _extract_topics or _extract_words_per_topic? That's why I ended up adding another argument similar to `calculate_aspects=True` I agree that it looks more messy now....

Representations from representation_models are generated twice when nr_topics="auto".

But in that case, it would still calculate the "main" representation (which could be an LLM prompt) As far as I understand, the `calculate aspects=True` arg is controlling whether additional...

Representations from representation_models are generated twice when nr_topics="auto".

I agree, I will try. But there is an issue when dealing with the case of `nr_topics >= initial_nr_topics`. In that case the representations need to be re-calculated without any...

Representations from representation_models are generated twice when nr_topics="auto".

If we don't have that adaptation the following happens: 1) User inputs a desired nr_topics 2) `_extract_topics(calculate_representations = False)` calculates `initial_nr_topic `(topic size) If the model **can** be reduced, then...