hdbscan Any advice on creating clusters based on new data and disambiguating them from existing clusters?

Any advice on creating clusters based on new data and disambiguating them from existing clusters?

Open alderaan opened this issue 3 years ago • 1 comments

Hey folks,

Background:

In my company we seek to employ HDBSCAN as part of a tool to discover clusters of anomalous entities. To that extent we first run an ensemble of anomaly detection algoritms and then take all anomalous entities and use HDBSCAN to see if there are any meaningful clusters of anomalous entities behaving similarly.
We run this pipeline every day and want to see if any new clusters have appeared, while retaining the old clusters. We also reassign entities to new clusters in case their behavior has changed.
Thus, it is not enough for us to use approximate_predict on an already trained HDBSCAN. We don't only want to predict existing clusters for new data points but we also want to see if there are new clusters and disambiguate them from existing clsuters.
But if we just train a new HDBSCAN model on the new data, and let's say yesterday we had 25 clusters and today we have 28 clusters, we do not know which of those 25 clusters are the same with any of the new 28 clusters. But for the sake of tracking anomalous clusters over time, we need to know this. We cannot lose information of "which cluster is which" over time.

What we did so far:

So far we first ran the clustering.
Then we computed the centroid (mean) of all exemplars for every cluster.
We then calculated simple euclidian distance of all entities to all centroids and assigned entities to the centroid with the lowest distance.
We also calculated the euclidian distance of all new cluster centroids to existing cluster centroids. If this distance was below a threshold, we discarded the new cluster with the argument that it is "too similar" to an existing cluster and does not constitute a new cluster.
We are aware that calculating the mean of a non-spheroid cluster is not a good idea, and also that "medoids" are in principle better, if it has to be done at all. But so far it worked better than medoids for some reason, based on validation data. We might change this in the future.

What we would like to do:

Ideally we would like to completely stop this business of calculating mean centers of cluster exemplars and assigning by euclidian distance.
Ideally we would just like to use HDBSCAN to assign to clusters and somehow have a good way of re-running the clustering every day and disambiguate new from existing clusters in a smart way, adding new clusters to our cluster list and reassign entities to new clusters if their behavior has changed.
I guess what we want to do is fairly similar to the idea of "stream-clustering" but run daily. We are aware that there are some papers on this topic but were hoping on an simple pointers or ideas to get us started.

I hope this was somewhat clear. If there is anyone who has a good idea how this could be achieved then this would be amazing. Thanks a lot!

Feb 24 '22 06:02 alderaan

Well, do you know why you choose HDBSCAN, is it because your data makes not convex clusters ?

Did you checked the principle of Outlier Factor ? It might help you, i don't have answers though.

Mar 10 '22 15:03 yanniouamara