hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Crash when points are equal

Open mdruiter opened this issue 1 year ago • 1 comments

I experienced a hard Python crash on some real-life data. It turns out it's a call to .minimum_spanning_tree_.plot() when many points are exactly equal. A minimal example to reproduce:

import numpy as np
import hdbscan
model = hdbscan.HDBSCAN(gen_min_span_tree=True)
data = np.zeros((91, 3))
clustering = model.fit(data)
clustering.minimum_spanning_tree_.plot()

Note that it also happens when only a relative small proportion of points are equal (but only sometimes?), this is just the easiest way to show it.

I looked into this and it appears to be a problem in sklearn.manifold._t_sne._barnes_hut_tsne.gradient(),not (always?) being able to handle nans. For example:

import numpy as np
from sklearn.manifold._t_sne import _barnes_hut_tsne
neighbors = np.array([1, 2, 0, 2, 0, 1], dtype='int64')
val_P = np.full_like(neighbors, 2 / 45, dtype='float32')
pos_output = np.full((3, 2), np.nan, dtype='float32')
forces = np.zeros_like(pos_output)
indptr = np.arange(7, step=2, dtype='int64')
_barnes_hut_tsne.gradient(val_P, pos_output, neighbors, indptr, forces, 0.5, 2, 11)

One layer deeper, the crash occurs inside sklearn.neighbors._quad_tree._QuadTree.build_tree():

import numpy as np
from sklearn.neighbors._quad_tree import _QuadTree
qt = _QuadTree(2, 11)
X = np.full((3, 2), np.nan, dtype='float32')
qt.build_tree(X)

This output of this (due to verbose=11) up to the crash is:

[QuadTree] bounding box axis 0 : [nan, nan] [QuadTree] bounding box axis 1 : [nan, nan] [QuadTree] Inserting depth 0 [QuadTree] inserted point 0 in cell 0 [QuadTree] Inserting depth 0 [QuadTree] inserted point 0 in new child 1 [QuadTree] Inserting depth 0 [QuadTree] Inserting depth 1 ... [QuadTree] Inserting depth 6271 [QuadTree] inserted point 0 in new child 6272 [QuadTree] Inserting depth 6271 [QuadTree] Inserting depth 6272

I didn't dig into the QuadTree code. I'm unsure whether this is a bug in hdbscan, scikit-learn, Cython or CPython...?

mdruiter avatar Feb 01 '24 09:02 mdruiter

I think you might need to push this upstream to scikit-learn and the t-SNE implementation there.

lmcinnes avatar Feb 01 '24 10:02 lmcinnes