hdbscan
hdbscan copied to clipboard
Predicting clusters for new points: fast and portable method?
I need to use HDBSCAN clusters in a mobile app and I am exploring methods of doing this on device. Let's say, using Swift for a starter.
One approach I have in mind is fitting a neural net to predict the cluster. Specifically:
- HDBSCAN's
approximate_predict
returns strength - which would allow to use fuzzy labels to guide the NN; - I would train the clusterer as usual (non-parametrically), then use a separate dataset to generate training data with
approximate_predict
providing ground-truth answers.
In short, similar to UMAP's parametric mode, except I will "train" the cluster non-parametrically and then approximate it with NN.
A few questions:
- Has anyone tried that? Would that work?
- To generate training data for NN using pre-trained clusterer and its
approximate_predict
, what would make the most sense as training inputs:- the same dataset as used to nonparametrically-"train" the clusterer in the first place?
- a different dataset distributed similarly to the dataset which clusterer was nonparametrically-"trained" on?
- a different dataset distributed differently: more noise and not as well clustered?
- randomly generated (say, Sobol sequence) vectors?
Thanks!