umap
umap copied to clipboard
Pickling error when using UMAP with pynndescent in a Spark-based environment
When using UMAP (without pynndescent) in a Spark-based environment, it works fine.
Instead, when using UMAP with pynndescent, I get the following error:
_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed
Traceback (most recent call last):
File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>
I use numba 0.50.1
Hmm, this is a new one. I'll have to look into this but I can't promise any swift resolution.
I am also running into the same issue in a Spark based environment.
A number of pickling issues were resolved in pynndescent v0.5 and umap v0.5 now depends on that -- are you seeing the issue with the newer versions? If so it seems like something specific to the spark environment, which is a little more confusing.
Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?
Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?
Also having this problem with these exact versions
Is it specifically complaining about the collections.FlatTree? If so I'll look and see if there is some reason that is still potentially causing issues.
Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?
No @simon-slowik unfortunately I didn't find a solution yet for working with more recent versions of umap-learn from within a spark based environment. For now, I resorted to an older version of umap-learn (0.4.6) and using it without pynndescent.
@lmcinnes yes the error message is specifically about collections.FlatTree.
It seems, pinning both umap-learn and pynndescent to 0.5.0 fixes the issue.
Thanks @simon-slowik . It is particularly interesting that pinning to 0.5 works. I'll see if I can get this fixed.
I was running into this today, and I think I found the cause. It looks like pyspark is overwriting collections.namedtuple with their own function, but that function isn't correctly setting the __module__ field.
Here's a minimal reproduction:
>>> # don't load `pyspark`
>>> import collections
>>> collections.namedtuple("n", [])
<class '__main__.n'>
>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [])
<class 'collections.n'>
I think there are a couple options for "fixing" this:
- Fix the issue in
pyspark: either set the__module__correctly (as the stdlib does) or "un-hijack"collection.namedtupleafter they're done with it. - Pass the
module=kwarg tonamedtupleas__name__. This avoids needing to make a change topysparkbut does require a change topynndescent(seems like @lmcinnes is the maintainer of that one too, so it shouldn't be too hard :smile: ). For example:>>> import pyspark >>> import collections >>> collections.namedtuple("n", [], module=__name__) <class '__main__.n'>
In the meantime, users can monkey-patch pynndescent.rp_trees.FlatTree after importing it to get around the bug:
>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__ = "pynndescent.rp_trees"
>>> ... # my code goes after this
This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.
@lmcinnes if you think (2) is a reasonable course of action, I'm happy to open a PR.
Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.
I use
>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__ = "pynndescent.rp_trees"
then in fmin from hyperopt I set the following argument:
>>> trials=hyperopt.SparkTrials(n_trials)
and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.
I'd like also to understand how can I choose to deactivate pynndescent from UMAP.
Thank you in advance.
BTW the error is:
Error Message
Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
WARNING:hyperopt-spark:Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
7%|▋ | 2/30 [01:36<21:36, 46.32s/trial, best loss: 0.29299980027960854]
trial task 2 failed, exception is Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing...
[0]: <class 'type'>: 94387749868256.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 305, in save
return super().save(obj)
File "/usr/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.7/pickle.py", line 1016, in save_type
return self.save_global(obj)
File "/usr/lib/python3.7/pickle.py", line 960, in save_global
(obj, module_name, name)) from None
_pickle.PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/hyperopt/spark.py", line 468, in run_task_on_executor
params, ctrl=None, attach_attachments=False
File "/usr/local/lib/python3.7/dist-packages/hyperopt/base.py", line 892, in evaluate
rval = self.fn(pyll_rval)
File "<ipython-input-19-360500d75c7a>", line 11, in objective
File "<ipython-input-17-7920b40f7bd3>", line 13, in generate_clusters
File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2772, in fit_transform
self.fit(X, y)
File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2526, in fit
verbose=self.verbose,
File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 340, in nearest_neighbors
compressed=False,
File "/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py", line 780, in __init__
self._angular_trees,
File "/usr/local/lib/python3.7/dist-packages/pynndescent/rp_trees.py", line 994, in make_forest
for i in range(n_trees)
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1056, in __call__
self.retrieve()
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in __call__
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in <listcomp>
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 434, in _compile_for_args
raise e
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 367, in _compile_for_args
return self.compile(tuple(argtypes))
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 819, in compile
cres = self._compiler.compile(args, return_type)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 78, in compile
status, retval = self._compile_cached(args, return_type)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 92, in _compile_cached
retval = self._compile_core(args, return_type)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 110, in _compile_core
pipeline_class=self.pipeline_class)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 627, in compile_extra
return pipeline.compile_extra(func)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 363, in compile_extra
return self._compile_bytecode()
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 425, in _compile_bytecode
return self._compile_core()
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 405, in _compile_core
raise e
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 396, in _compile_core
pm.run(self.state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 341, in run
raise patched_exception
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 332, in run
self._runPass(idx, pass_inst, state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 291, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 264, in check
mangled = func(compiler_state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 442, in run_pass
NativeLowering().run_pass(state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 372, in run_pass
lower.create_cpython_wrapper(flags.release_gil)
File "/usr/local/lib/python3.7/dist-packages/numba/core/lowering.py", line 244, in create_cpython_wrapper
release_gil=release_gil)
File "/usr/local/lib/python3.7/dist-packages/numba/core/cpu.py", line 161, in create_cpython_wrapper
builder.build()
File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 122, in build
self.build_wrapper(api, builder, closure, args, kws)
File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 176, in build_wrapper
obj = api.from_native_return(retty, retval, env_manager)
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1388, in from_native_return
out = self.from_native_value(typ, val, env_manager)
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1402, in from_native_value
return impl(typ, val, c)
File "/usr/local/lib/python3.7/dist-packages/numba/core/boxing.py", line 502, in box_namedtuple
cls_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.instance_class))
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1363, in serialize_object
struct = self.serialize_uncached(obj)
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1334, in serialize_uncached
data = serialize.dumps(obj)
File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 168, in dumps
p.dump(obj)
File "/usr/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 314, in save
raise _TracedPicklingError(m)
numba.core.serialize._TracedPicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing...
[0]: <class 'type'>: 94387749868256
....
....
....
....
....
....
37%|███▋ | 11/30 [11:05<19:08, 60.46s/trial, best loss: 0.28700818853604954]
Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.
INFO:hyperopt-spark:Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
[<ipython-input-32-0354e1710e5c>](https://localhost:8080/#) in <module>()
20 label_lower=label_lower,
21 label_upper=label_upper,
---> 22 max_evals=max_evals)
[<ipython-input-30-e8f8e02ddb3a>](https://localhost:8080/#) in bayesian_search(embeddings, space, label_lower, label_upper, max_evals)
12 trials=hyperopt.SparkTrials())
13
---> 14 best_params = space_eval(space, best)
15 print ('best:')
16 print (best_params)
NameError: name 'space_eval' is not defined
edit 2:
if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.
Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.
I use
>>> import pyspark >>> import pynndescent >>> pynndescent.rp_trees.FlatTree.__module__ = "pynndescent.rp_trees"then in
fminfromhyperoptI set the following argument:>>> trials=hyperopt.SparkTrials(n_trials)and the error follows. Please note that if
n_trialsareNoneor a negative number, it seems to work properly for some trials, otherwise it spits back this error.I'd like also to understand how can I choose to deactivate pynndescent from UMAP.
Thank you in advance.
BTW the error is:
Error Message edit 2:
if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.
Same, but I'm not sure hyperopt used in my project.
I speculate if it has something to do with pyspark's optimization of CPickle serialization in python < 3.8 environment?https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/serializers.py#L361
I am using pyspark 3.1.2 and umap-learn 0.5.2, python 3.7.10