joblib error for (relatively) big min_cluster_size parameter values
Hi guys,
thank you for the implementation of the algorithm - it works mostly incredibly good. But only recently have I encountered an error that is not addressed according to a quick search in google. I am trying to cluster a dataset of size (560823, 2). I have successfully clustered 10x larger datasets, but this time I've a particular problem. In my dataset, there is a central, dense, mass of points, that I would like to 'extract' from the rest, rather loose observations. HDBSCAN and other density-based methods seem to be perfect for that. In order to extract this mass and disregard all the rest I am setting the parameter min_cluster_size to 10000 (I've tried also 5000 and 2500). In all the cases the procedure halts and throws an error with the following stack:
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
exception=exception))
File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put
self._writer.send_bytes(obj)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""
The above exception was the direct cause of the following exception:
error Traceback (most recent call last)
<ipython-input-31-5141c7286f77> in <module>
3 hdbscan_obj = HDBSCAN(min_cluster_size=2500, )
4 # hdbscan_obj.fit(SWISSPROT_PCA_TABLE)
----> 5 hdbscan_obj.fit(SWISSPROT_UMAP_TABLE[['umap_0', 'umap_1']].sample(300000))
6 labels = hdbscan_obj.labels_
7 pkl.dump(labels, open('HDBSCAN_labels.pkl', 'wb'))
/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
917 self._condensed_tree,
918 self._single_linkage_tree,
--> 919 self._min_spanning_tree) = hdbscan(X, **kwargs)
920
921 if self.prediction_data:
/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
613 approx_min_span_tree,
614 gen_min_span_tree,
--> 615 core_dist_n_jobs, **kwargs)
616 else: # Metric is a valid BallTree metric
617 # TO DO: Need heuristic to decide when to go to boruvka;
/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
353
354 def __call__(self, *args, **kwargs):
--> 355 return self.func(*args, **kwargs)
356
357 def call_and_shelve(self, *args, **kwargs):
/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
276 leaf_size=leaf_size // 3,
277 approx_min_span_tree=approx_min_span_tree,
--> 278 n_jobs=core_dist_n_jobs, **kwargs)
279 min_spanning_tree = alg.spanning_tree()
280 # Sort edges of the min_spanning_tree by weight
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1015
1016 with self._backend.retrieval_context():
-> 1017 self.retrieve()
1018 # Make sure that we get a last message telling us we are done
1019 elapsed_time = time.time() - self._start_time
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
907 try:
908 if getattr(self._backend, 'supports_timeout', False):
--> 909 self._output.extend(job.get(timeout=self.timeout))
910 else:
911 self._output.extend(job.get())
/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
560 AsyncResults.get from multiprocessing."""
561 try:
--> 562 return future.result(timeout=timeout)
563 except LokyTimeoutError:
564 raise TimeoutError()
/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
error: 'i' format requires -2147483648 <= number <= 2147483647
The error is, as you can see, quite cryptic :) The code I am using:
hdbscan_obj = HDBSCAN(min_cluster_size=2500, )
hdbscan_obj.fit(MY_TABLE[['var_0', 'var_1']]
I installed hdbscan ver. 0.8.26 via conda and I am running it on python 3.7.7.
Further info:
As I suspect that some kind of algorithm's internal object is too big, I played a little with the params and it turns out, that for my dataset the errors starts popping up somewhere between 200'000 and 300'000 observations for the min_cluster_size of 2500 and somewhere between 50'000 and 100'000 for min_cluster_size of 10000, so these two params are interconnected.
Cheers!
I am having an identical issue. Any thoughts on this?
The parameter min_samples is used for computing the linkage tree, and defaults to the value of min_cluster_size. When setting min_cluster_size to a large value min_samples should be set to something smaller to reduce memory usage.
(Old issue, but still open so adding this in case anyone else has similar trouble)