DanR

Results 48 comments of DanR

Not exactly clear what you are asking but when you call `BERTopic.fit_transrorm()` it returns two arrays. ``` topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs, embeddings) ``` topics in this case...

@kaifeijidezxb The returned value from fit_transform as I indicated will give you a per-document topic assignment - that sounds like it is what you want.

Just FYI there is an implementation of DBCV built into HDBSCAN - '[relative_validity](https://hdbscan.readthedocs.io/en/latest/api.html)'. This version does come with a caveat however - > This score might not be an objective...

@kjaksic Glad that was helpful. I guess I'm not surprised that lowering the dimensionality reduces outliers - but isn't that just a function of having less accurate data? Is this...

@kjaksic Yup. I'd be interested to see what you come up with. I've had pretty much 0 luck using DBCV or any other metric for that matter. I've started a...

@dimitry12 Oops. I'm new to gists and wound up deleting that link. I'll update it in a moment with something that works. However, since then I've pushed a preliminary [code...

@MaartenGr will be able to be a lot more definitive, but I'll give it a go. So in terms of pre-processing the data it is a bit confusing when new...

I believe the short answer is no. The issue is that the embedding process creates a model of the entire document set and each document is embedded relative to the...

Ahhh... Ok. Good to know. I think I understand my mistake. I recently had occasion to split a corpus into two segments. When I split `BERTopic.umap_model.embedding_` I didn't get the...

Each run of BERTopic will create slightly different outputs because some of the underlying algorithms are stochastic. There are also differences between which documents get assigned to which topics, but...