hdbscan Can I force approximate_predict to assign every embedding to an existing cluster?

Hello,

Let me see if I am understanding things correctly.

I am reducing dimensionality with UMAP:

		clusterable_embedding_large = umap.UMAP(
		    n_neighbors=n_neighbors,
		    min_dist=.0,
		    n_components=comp,
		    random_state=31416,
		    metric='cosine'
		).fit_transform(df_dist)

Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):

		cel_long = clusterable_embedding_large[long_seg]
		cel_shor = clusterable_embedding_large[shor_seg]

Then I cluster the long sentences only:

		clusterer = hdbscan.HDBSCAN(
		    min_samples=1,
		    min_cluster_size=cluster_size,
		    #cluster_selection_method='eom',
		    cluster_selection_method='leaf',
		    cluster_selection_epsilon=5,
		    gen_min_span_tree=True,
		    prediction_data=True
		).fit(cel_long)

Next I would like to assign each of the short sentences to one of the pre-existing clusters:

		labels = list(clusterer.labels_)
		labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
		labels_short = list(labels_short)
		
		print(labels)
                [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
		print(labels_short)
               [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]

However, I face two issues:

Some points are not assigned (label -1).
Some points are assigned to a new cluster which did not exist in the original clustering (label 2).

The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?

On the other hand, I believe that the second issue was not possible. From the docs:

With that done you can run [approximate_predict()](https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.

Can this be also avoided?

Best,

Ed

Jul 05 '23 12:07 mirix

I think you want to try the soft clustering options to manage to do that.

On Wed, Jul 5, 2023 at 8:17 AM mirix @.***> wrote:

Hello,

Let me see if I am understanding things correctly.

I am reducing dimensionality with UMAP:
  clusterable_embedding_large = umap.UMAP(
      n_neighbors=n_neighbors,
      min_dist=.0,
      n_components=comp,
      random_state=31416,
      metric='cosine'
  ).fit_transform(df_dist)
Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):
  cel_long = clusterable_embedding_large[long_seg]
  cel_shor = clusterable_embedding_large[shor_seg]
Then I cluster the long sentences only:
  clusterer = hdbscan.HDBSCAN(
      min_samples=1,
      min_cluster_size=cluster_size,
      #cluster_selection_method='eom',
      cluster_selection_method='leaf',
      cluster_selection_epsilon=5,
      gen_min_span_tree=True,
      prediction_data=True
  ).fit(cel_long)
Next I would like to assign each of the short sentences to one of the pre-existing clusters:
  labels = list(clusterer.labels_)
  labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
  labels_short = list(labels_short)
  
  print(labels)
            [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
  print(labels_short)
           [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]
However, I face two issues:

Some points are not assigned (label -1). 2.

Some points are assigned to a new cluster which did not exist in the original clustering (label 2).

The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?

On the other hand, I believe that the second issue was not possible. From the docs:

With that done you can run approximate_predict() with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.

Can this be also avoided?

Best,

Ed

— Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/599, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBI7MUOLSMMA4ZRFOJDXOVLMDANCNFSM6AAAAAAZ63ZLOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jul 05 '23 13:07 lmcinnes

Thanks, it seems promising. I will look into that.

In the meantime, I have found a workaround:

I cluster all the points together as usual. Then, for each short sentence, I compute the average distance from each cluster (excluding short sentences) and reassign if required.

This seems to solve the problem on the current dataset.

Jul 05 '23 14:07 mirix

In case your are interested, HDBSCAN works wonderfully for clustering speakers in a diarisation project:

https://github.com/mirix/approaches-to-diarisation

I am really impressed. The challenge now would be to come up with some heuristics or ML to guess the optimal parameters automatically.

Jul 06 '23 07:07 mirix