hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Predicting clusters for new points: fast and portable method?

Open dimitry12 opened this issue 2 years ago • 0 comments

I need to use HDBSCAN clusters in a mobile app and I am exploring methods of doing this on device. Let's say, using Swift for a starter.

One approach I have in mind is fitting a neural net to predict the cluster. Specifically:

  • HDBSCAN's approximate_predict returns strength - which would allow to use fuzzy labels to guide the NN;
  • I would train the clusterer as usual (non-parametrically), then use a separate dataset to generate training data with approximate_predict providing ground-truth answers.

In short, similar to UMAP's parametric mode, except I will "train" the cluster non-parametrically and then approximate it with NN.

A few questions:

  • Has anyone tried that? Would that work?
  • To generate training data for NN using pre-trained clusterer and its approximate_predict, what would make the most sense as training inputs:
    • the same dataset as used to nonparametrically-"train" the clusterer in the first place?
    • a different dataset distributed similarly to the dataset which clusterer was nonparametrically-"trained" on?
    • a different dataset distributed differently: more noise and not as well clustered?
    • randomly generated (say, Sobol sequence) vectors?

Thanks!

dimitry12 avatar May 26 '22 17:05 dimitry12