hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

OSError: [Errno 22] Invalid argument with large dataset

Open PezAmaury opened this issue 4 years ago • 1 comments

Hi!

I'm following this approach for a dataset of about 45'000 documents (so less than this notebook): https://github.com/MNoichl/structure_economics_2019/blob/master/econ_code.ipynb

When I reach cell 16

plt.figure(figsize=(10,10))

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=300,min_samples =100).fit(to_cluster.embedding_)
plt.scatter(to_cluster.embedding_[:, 0], to_cluster.embedding_[:, 1], s=0.1, c=clusterer.labels_, cmap='Spectral')
col_len = len(set(clusterer.labels_))-1



print(col_len)

I get the following error

Traceback (most recent call last):
  File "UMAP.py", line 305, in <module>
    clusterer_count = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=2).fit(embedding_count.embedding_)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 615, in hdbscan
    core_dist_n_jobs, **kwargs)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/hdbscan/hdbscan_.py", line 278, in _hdbscan_boruvka_kdtree
    n_jobs=core_dist_n_jobs, **kwargs)
  File "hdbscan/_hdbscan_boruvka.pyx", line 375, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
  File "hdbscan/_hdbscan_boruvka.pyx", line 411, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/parallel.py", line 949, in __call__
    n_jobs = self._initialize_backend()
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/parallel.py", line 710, in _initialize_backend
    **self._backend_args)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 509, in configure
    **memmappingexecutor_args)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/executor.py", line 37, in get_memmapping_executor
    initargs=initargs, env=env)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 116, in get_reusable_executor
    executor_id=executor_id, **kwargs)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 153, in __init__
    initializer=initializer, initargs=initargs, env=env)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 915, in __init__
    self._processes_management_lock = self._context.Lock()
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/context.py", line 225, in Lock
    return Lock()
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/synchronize.py", line 174, in __init__
    super(Lock, self).__init__(SEMAPHORE, 1, 1)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/synchronize.py", line 90, in __init__
    resource_tracker.register(self._semlock.name, "semlock")
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/resource_tracker.py", line 171, in register
    self.ensure_running()
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/resource_tracker.py", line 143, in ensure_running
    pid = spawnv_passfds(exe, args, fds_to_pass)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/resource_tracker.py", line 301, in spawnv_passfds
    return fork_exec(args, _pass)
  File "/mnt/c/Users/thiabaud/Documents/Environments/ASR/lib/python3.6/site-packages/joblib/externals/loky/backend/fork_exec.py", line 43, in fork_exec
    pid = os.fork()
OSError: [Errno 22] Invalid argument

I have run the code in both jupyter and directly from command line, but I keep getting this error. I tried on smaller random subsets, in subsets following each other with random lengths, and they work fine: only when I use the full dataset do I get the error. Attached the list of packages of my environment.

list.txt

PezAmaury avatar May 14 '20 08:05 PezAmaury

Did you find a solution to this problem? I am runnng into the same thing. It seems like it only happens with datasets > 2^14 points.

evanderveer avatar Aug 30 '22 06:08 evanderveer