Maarten Grootendorst
Maarten Grootendorst
> I have been looking into this issue because on the face of it it looks like having a large number of -1s is sub-optimal. On the one hand, you...
@drob-xx Indeed, although using `calculate_probabilities=True` is the easiest and likely to be the best fitting, the computation time to run this is definitely a bottleneck! > The first is about...
> Yes. I mentioned the t-sne because using it to plot 2D points for a visualization gives a good starting place for understanding the distribution/organization of documents as represented in...
@Sghosh023 Yes, you can! The -1 topic is generated through the use of HDBSCAN, which identifies outliers. You could use k-Means instead to generate those 10 topics without the -1...
There are several ways of approaching it. You can find most of them, using HDBSCAN, in the documentation [here](https://maartengr.github.io/BERTopic/faq.html#how-do-i-reduce-topic-outliers). Essentially, you would be using `calculate_probabilities=True` to create a document-topic probability...
Now that I think about it, you could even use k-Means with a large `k` and set `nr_topics="auto"` in BERTopic to reduce them automatically.
@Sghosh023 Apologies! No, this model will work best with either sentences or paragraphs but not longer documents. You can find a bunch of other models that work quite well [here](https://www.sbert.net/docs/pretrained_models.html),...
@Sghosh023 > How can I pass that to the BerTopic model & make sure that the embedding creation process doesn't takes place using any sentence transformer model. You can do...
@Sghosh023 This is related to UMAP, which is a stochastic model that generates a different result each time. To control for that, I would advise reading through [this FAQ](https://maartengr.github.io/BERTopic/faq.html#why-are-the-results-not-consistent-between-runs) for...
No problem, glad I could be of help! There are several ways to perform computation with large datasets. First, you can set `low_memory` to True when instantiating BERTopic. This may...