hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

joblib error for (relatively) big min_cluster_size parameter values

Open Garrus990 opened this issue 5 years ago • 2 comments

Hi guys,

thank you for the implementation of the algorithm - it works mostly incredibly good. But only recently have I encountered an error that is not addressed according to a quick search in google. I am trying to cluster a dataset of size (560823, 2). I have successfully clustered 10x larger datasets, but this time I've a particular problem. In my dataset, there is a central, dense, mass of points, that I would like to 'extract' from the rest, rather loose observations. HDBSCAN and other density-based methods seem to be perfect for that. In order to extract this mass and disregard all the rest I am setting the parameter min_cluster_size to 10000 (I've tried also 5000 and 2500). In all the cases the procedure halts and throws an error with the following stack:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
    exception=exception))
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

error                                     Traceback (most recent call last)
<ipython-input-31-5141c7286f77> in <module>
      3 hdbscan_obj = HDBSCAN(min_cluster_size=2500, )
      4 # hdbscan_obj.fit(SWISSPROT_PCA_TABLE)
----> 5 hdbscan_obj.fit(SWISSPROT_UMAP_TABLE[['umap_0', 'umap_1']].sample(300000))
      6 labels = hdbscan_obj.labels_
      7 pkl.dump(labels, open('HDBSCAN_labels.pkl', 'wb'))

/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    917          self._condensed_tree,
    918          self._single_linkage_tree,
--> 919          self._min_spanning_tree) = hdbscan(X, **kwargs)
    920 
    921         if self.prediction_data:

/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    613                                              approx_min_span_tree,
    614                                              gen_min_span_tree,
--> 615                                              core_dist_n_jobs, **kwargs)
    616         else:  # Metric is a valid BallTree metric
    617             # TO DO: Need heuristic to decide when to go to boruvka;

/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    353 
    354     def __call__(self, *args, **kwargs):
--> 355         return self.func(*args, **kwargs)
    356 
    357     def call_and_shelve(self, *args, **kwargs):

/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    276                                  leaf_size=leaf_size // 3,
    277                                  approx_min_span_tree=approx_min_span_tree,
--> 278                                  n_jobs=core_dist_n_jobs, **kwargs)
    279     min_spanning_tree = alg.spanning_tree()
    280     # Sort edges of the min_spanning_tree by weight

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1015 
   1016             with self._backend.retrieval_context():
-> 1017                 self.retrieve()
   1018             # Make sure that we get a last message telling us we are done
   1019             elapsed_time = time.time() - self._start_time

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
    907             try:
    908                 if getattr(self._backend, 'supports_timeout', False):
--> 909                     self._output.extend(job.get(timeout=self.timeout))
    910                 else:
    911                     self._output.extend(job.get())

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    560         AsyncResults.get from multiprocessing."""
    561         try:
--> 562             return future.result(timeout=timeout)
    563         except LokyTimeoutError:
    564             raise TimeoutError()

/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

error: 'i' format requires -2147483648 <= number <= 2147483647

The error is, as you can see, quite cryptic :) The code I am using:

hdbscan_obj = HDBSCAN(min_cluster_size=2500, )
hdbscan_obj.fit(MY_TABLE[['var_0', 'var_1']]

I installed hdbscan ver. 0.8.26 via conda and I am running it on python 3.7.7.

Further info: As I suspect that some kind of algorithm's internal object is too big, I played a little with the params and it turns out, that for my dataset the errors starts popping up somewhere between 200'000 and 300'000 observations for the min_cluster_size of 2500 and somewhere between 50'000 and 100'000 for min_cluster_size of 10000, so these two params are interconnected.

Cheers!

Garrus990 avatar May 05 '20 14:05 Garrus990

I am having an identical issue. Any thoughts on this?

eparsonnet93 avatar Jul 01 '20 06:07 eparsonnet93

The parameter min_samples is used for computing the linkage tree, and defaults to the value of min_cluster_size. When setting min_cluster_size to a large value min_samples should be set to something smaller to reduce memory usage.

(Old issue, but still open so adding this in case anyone else has similar trouble)

sa2329 avatar Aug 21 '24 17:08 sa2329