BERTopic
BERTopic copied to clipboard
DBCV coefficient
Hi,
instead of calculating silhouette score as a measure of cluster coherence, I was thinking about calculating DBCV (density-based cluster validation) since it is assumed to be more suitable for the evaluation of non-spherical clusters. I am planning to use DBCV package to achieve that: https://github.com/christopherjenness/DBCV.
In the discussion about the silhouette score, you proposed to drop outliers from the analysis, since they do not represent the actual cluster: https://github.com/MaartenGr/BERTopic/issues/428. However, if I got the concept correctly, they should be included in the calculation of DBCV score since it assumes there is a noise in the data? So my code would be like:
umap_embeddings = topic_model.umap_model.transform(embeddings)
indices = [index for index, topic in enumerate(topics)]
X = umap_embeddings[np.array(indices)]
labels = [topic for index, topic in enumerate(topics)]
hdbscan_score = DBCV(X, labels, dist_function=euclidean) print(hdbscan_score)
Thanks.
Just FYI there is an implementation of DBCV built into HDBSCAN - 'relative_validity'. This version does come with a caveat however -
This score might not be an objective measure of the goodness of clusterering. It may only be used to compare results across different choices of hyper-parameters, therefore is only a relative score.
However, in my experience so far using metrics to judge the goodness of HDBSCAN results have been unconvincing. So far the best generalized approach that seems to be working is to look at the overall results of cluster formation (number of clusters and number of outliers). What I find is that there are 'natural' cluster sizes that appear. These can then be run and compared. See #582 for some parallel discussion and I've pushed a gist that is (I think) a reasonably straight forward methodology for analyzing HDBSCAN outputs for these purposes.
@drob-xx Thank you! HDBSCAN implementation runs really fast. I indeed need DBCV to decide on optimal hyper-parameters (as one of the criteria).
Thank you for those additional resources. Dealing with the outliers seems to be the most tricky part of the Bertopic. I am currently working with social media data, which are relatively short in length, and from my experience, it seems that UMAP dimensionality plays a big role in those cases, it is not just HDBSCAN. Lowering the dimensionality (from 5 to 2 or 3) seems to reduce the number of outliers, but I have to carefully check the topic quality and decide which combination of the parameters will be the best.
@kjaksic Glad that was helpful. I guess I'm not surprised that lowering the dimensionality reduces outliers - but isn't that just a function of having less accurate data? Is this a good thing? How can you be sure that the outliers are being categorized 'correctly', whatever that means in your context?
@drob-xx I agree with you that a decrease in the number of outliers does not necessarily mean that the model is better. That is why I am looking at the other criteria as well, including DBCV to compare the models and see how hyper-parameters influence cluster structure.
@kjaksic Yup. I'd be interested to see what you come up with. I've had pretty much 0 luck using DBCV or any other metric for that matter. I've started a broader discussion at #600 if you care to join in.
See #582 for some parallel discussion and I've pushed a gist that is (I think) a reasonably straight forward methodology for analyzing HDBSCAN outputs for these purposes.
@drob-xx can you please repost the gist-link. This one does not work. Thanks!
@dimitry12 Oops. I'm new to gists and wound up deleting that link. I'll update it in a moment with something that works. However, since then I've pushed a preliminary code solution which is easier to use (I think). I suggest you start with that. Also, I began a thread in discussions around it at #635, and if you have feedback would love to hear what your experience/thoughts are.
Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!