hdbscan
hdbscan copied to clipboard
Counter-intuitive noise points
I have been playing with hdbscan to try and build an intuition for what it is doing. Currently I am running into counter-intuitive behavior when running it on synthetic data. In particular I have been running hdbscan on data sampled evenly from a circle. My understanding of the algorithm suggests it should return a single cluster similar to what dbscan would do with the proper epsilon setting. However, hdbscan is instead identifying a single cluster and a collection of noise points. If I reduce the minimum number of points needed for a cluster below 4 the noise points vanish. Due to the symmetry of the data I'm not seeing why this parameter should make much of a difference on how the clustering works. I'm curious if my intuition is way off or if there is an issue with how I am invoking hdbscan.
Code:
from hdbscan import HDBSCAN
import matplotlib.pyplot as plt
import numpy as np
min_cluster_size = 4 # Minimum size to cause an issue
samples = 100
theta = np.linspace(-np.pi, np.pi, samples, endpoint=False)
data = np.zeros((samples, 2))
data[:,0] = np.cos(theta)
data[:,1] = np.sin(theta)
clusterer = HDBSCAN(min_cluster_size=min_cluster_size, allow_single_cluster=True)
clusterer.fit(data)
labels = set(clusterer.labels_)
for label in labels:
cluster = data[clusterer.labels_ == label]
plt.scatter(cluster[:,0], cluster[:,1])
plt.axis("equal")
plt.show()
Expected output:
Actual output:
(note the orange noise points in the lower right)
System Information: python version: 3.10.8 hdbscan version: 0.8.29
That's interesting, I'd also like to have an explanation. Setting min_samples to 3 leads to expected results.
I personally always set min_samples explicitly and that's important especially when using large min_cluster_size, because the default value, ie a value equal to min_cluster_size, would lead to ultra-conservative clustering. However, that's not the case here.
Perhaps it has something to do with float accuracy causing the intended equidistant steps not be truly equidistant?
With theta = np.linspace(-np.pi, np.pi, samples, endpoint=False):
With theta = np.linspace(-3.14, 3.14, samples, endpoint=False):