hdbscan Counter-intuitive noise points

Counter-intuitive noise points

Open erooke opened this issue 2 years ago • 1 comments

I have been playing with hdbscan to try and build an intuition for what it is doing. Currently I am running into counter-intuitive behavior when running it on synthetic data. In particular I have been running hdbscan on data sampled evenly from a circle. My understanding of the algorithm suggests it should return a single cluster similar to what dbscan would do with the proper epsilon setting. However, hdbscan is instead identifying a single cluster and a collection of noise points. If I reduce the minimum number of points needed for a cluster below 4 the noise points vanish. Due to the symmetry of the data I'm not seeing why this parameter should make much of a difference on how the clustering works. I'm curious if my intuition is way off or if there is an issue with how I am invoking hdbscan.

Code:

from hdbscan import HDBSCAN
import matplotlib.pyplot as plt
import numpy as np

min_cluster_size = 4 # Minimum size to cause an issue
samples = 100

theta = np.linspace(-np.pi, np.pi, samples, endpoint=False)
data = np.zeros((samples, 2))
data[:,0] = np.cos(theta)
data[:,1] = np.sin(theta)

clusterer = HDBSCAN(min_cluster_size=min_cluster_size, allow_single_cluster=True)
clusterer.fit(data)

labels = set(clusterer.labels_)
for label in labels:
    cluster = data[clusterer.labels_ == label]
    plt.scatter(cluster[:,0], cluster[:,1])

plt.axis("equal")
plt.show()

Expected output:

Actual output: (note the orange noise points in the lower right)

System Information: python version: 3.10.8 hdbscan version: 0.8.29

Dec 07 '22 18:12 erooke

That's interesting, I'd also like to have an explanation. Setting min_samples to 3 leads to expected results. I personally always set min_samples explicitly and that's important especially when using large min_cluster_size, because the default value, ie a value equal to min_cluster_size, would lead to ultra-conservative clustering. However, that's not the case here.

Perhaps it has something to do with float accuracy causing the intended equidistant steps not be truly equidistant? With theta = np.linspace(-np.pi, np.pi, samples, endpoint=False):

With theta = np.linspace(-3.14, 3.14, samples, endpoint=False):

May 23 '23 12:05 lucetka

hdbscan hdbscan copied to clipboard

Counter-intuitive noise points

hdbscan
hdbscan copied to clipboard