DanR comments

Results 48 comments of


                                            DanR

How do I get the topic of each document in the data

Not exactly clear what you are asking but when you call `BERTopic.fit_transrorm()` it returns two arrays. ``` topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs, embeddings) ``` topics in this case...

How do I get the topic of each document in the data

@kaifeijidezxb The returned value from fit_transform as I indicated will give you a per-document topic assignment - that sounds like it is what you want.

DBCV coefficient

Just FYI there is an implementation of DBCV built into HDBSCAN - '[relative_validity](https://hdbscan.readthedocs.io/en/latest/api.html)'. This version does come with a caveat however - > This score might not be an objective...

DBCV coefficient

@kjaksic Glad that was helpful. I guess I'm not surprised that lowering the dimensionality reduces outliers - but isn't that just a function of having less accurate data? Is this...

DBCV coefficient

@kjaksic Yup. I'd be interested to see what you come up with. I've had pretty much 0 luck using DBCV or any other metric for that matter. I've started a...

DBCV coefficient

@dimitry12 Oops. I'm new to gists and wound up deleting that link. I'll update it in a moment with something that works. However, since then I've pushed a preliminary [code...

Inferior Performance without Stopwords Removal

@MaartenGr will be able to be a lot more definitive, but I'll give it a go. So in terms of pre-processing the data it is a bit confusing when new...

Running batches and aggregating results

I believe the short answer is no. The issue is that the embedding process creates a model of the entire document set and each document is embedded relative to the...

Running batches and aggregating results

Ahhh... Ok. Good to know. I think I understand my mistake. I recently had occasion to split a corpus into two segments. When I split `BERTopic.umap_model.embedding_` I didn't get the...

multiple topics containing same words only in a different order

Each run of BERTopic will create slightly different outputs because some of the underlying algorithms are stochastic. There are also differences between which documents get assigned to which topics, but...