hdbscan
hdbscan copied to clipboard
Crash when points are equal
I experienced a hard Python crash on some real-life data. It turns out it's a call to .minimum_spanning_tree_.plot()
when many points are exactly equal. A minimal example to reproduce:
import numpy as np
import hdbscan
model = hdbscan.HDBSCAN(gen_min_span_tree=True)
data = np.zeros((91, 3))
clustering = model.fit(data)
clustering.minimum_spanning_tree_.plot()
Note that it also happens when only a relative small proportion of points are equal (but only sometimes?), this is just the easiest way to show it.
I looked into this and it appears to be a problem in sklearn.manifold._t_sne._barnes_hut_tsne.gradient()
,not (always?) being able to handle nan
s. For example:
import numpy as np
from sklearn.manifold._t_sne import _barnes_hut_tsne
neighbors = np.array([1, 2, 0, 2, 0, 1], dtype='int64')
val_P = np.full_like(neighbors, 2 / 45, dtype='float32')
pos_output = np.full((3, 2), np.nan, dtype='float32')
forces = np.zeros_like(pos_output)
indptr = np.arange(7, step=2, dtype='int64')
_barnes_hut_tsne.gradient(val_P, pos_output, neighbors, indptr, forces, 0.5, 2, 11)
One layer deeper, the crash occurs inside sklearn.neighbors._quad_tree._QuadTree.build_tree()
:
import numpy as np
from sklearn.neighbors._quad_tree import _QuadTree
qt = _QuadTree(2, 11)
X = np.full((3, 2), np.nan, dtype='float32')
qt.build_tree(X)
This output of this (due to verbose=11
) up to the crash is:
[QuadTree] bounding box axis 0 : [nan, nan] [QuadTree] bounding box axis 1 : [nan, nan] [QuadTree] Inserting depth 0 [QuadTree] inserted point 0 in cell 0 [QuadTree] Inserting depth 0 [QuadTree] inserted point 0 in new child 1 [QuadTree] Inserting depth 0 [QuadTree] Inserting depth 1 ... [QuadTree] Inserting depth 6271 [QuadTree] inserted point 0 in new child 6272 [QuadTree] Inserting depth 6271 [QuadTree] Inserting depth 6272
I didn't dig into the QuadTree code.
I'm unsure whether this is a bug in hdbscan
, scikit-learn
, Cython or CPython...?
I think you might need to push this upstream to scikit-learn and the t-SNE implementation there.