umap
umap copied to clipboard
Pickling error when using UMAP with pynndescent in a Spark-based environment
When using UMAP (without pynndescent) in a Spark-based environment, it works fine.
Instead, when using UMAP with pynndescent, I get the following error:
_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed
Traceback (most recent call last):
File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>
I use numba 0.50.1
Hmm, this is a new one. I'll have to look into this but I can't promise any swift resolution.
I am also running into the same issue in a Spark based environment.
A number of pickling issues were resolved in pynndescent v0.5 and umap v0.5 now depends on that -- are you seeing the issue with the newer versions? If so it seems like something specific to the spark environment, which is a little more confusing.
Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?
Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?
Also having this problem with these exact versions
Is it specifically complaining about the collections.FlatTree
? If so I'll look and see if there is some reason that is still potentially causing issues.
Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?
No @simon-slowik unfortunately I didn't find a solution yet for working with more recent versions of umap-learn from within a spark based environment. For now, I resorted to an older version of umap-learn (0.4.6) and using it without pynndescent.
@lmcinnes yes the error message is specifically about collections.FlatTree
.
It seems, pinning both umap-learn and pynndescent to 0.5.0 fixes the issue.
Thanks @simon-slowik . It is particularly interesting that pinning to 0.5 works. I'll see if I can get this fixed.
I was running into this today, and I think I found the cause. It looks like pyspark
is overwriting collections.namedtuple
with their own function, but that function isn't correctly setting the __module__
field.
Here's a minimal reproduction:
>>> # don't load `pyspark`
>>> import collections
>>> collections.namedtuple("n", [])
<class '__main__.n'>
>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [])
<class 'collections.n'>
I think there are a couple options for "fixing" this:
- Fix the issue in
pyspark
: either set the__module__
correctly (as the stdlib does) or "un-hijack"collection.namedtuple
after they're done with it. - Pass the
module=
kwarg tonamedtuple
as__name__
. This avoids needing to make a change topyspark
but does require a change topynndescent
(seems like @lmcinnes is the maintainer of that one too, so it shouldn't be too hard :smile: ). For example:>>> import pyspark >>> import collections >>> collections.namedtuple("n", [], module=__name__) <class '__main__.n'>
In the meantime, users can monkey-patch pynndescent.rp_trees.FlatTree
after import
ing it to get around the bug:
>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__ = "pynndescent.rp_trees"
>>> ... # my code goes after this
This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.
@lmcinnes if you think (2) is a reasonable course of action, I'm happy to open a PR.
Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.
I use
>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__ = "pynndescent.rp_trees"
then in fmin
from hyperopt
I set the following argument:
>>> trials=hyperopt.SparkTrials(n_trials)
and the error follows. Please note that if n_trials
are None
or a negative number, it seems to work properly for some trials, otherwise it spits back this error.
I'd like also to understand how can I choose to deactivate pynndescent from UMAP.
Thank you in advance.
BTW the error is:
Error Message
Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
WARNING:hyperopt-spark:Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
7%|▋ | 2/30 [01:36<21:36, 46.32s/trial, best loss: 0.29299980027960854]
trial task 2 failed, exception is Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing...
[0]: <class 'type'>: 94387749868256.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 305, in save
return super().save(obj)
File "/usr/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.7/pickle.py", line 1016, in save_type
return self.save_global(obj)
File "/usr/lib/python3.7/pickle.py", line 960, in save_global
(obj, module_name, name)) from None
_pickle.PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/hyperopt/spark.py", line 468, in run_task_on_executor
params, ctrl=None, attach_attachments=False
File "/usr/local/lib/python3.7/dist-packages/hyperopt/base.py", line 892, in evaluate
rval = self.fn(pyll_rval)
File "<ipython-input-19-360500d75c7a>", line 11, in objective
File "<ipython-input-17-7920b40f7bd3>", line 13, in generate_clusters
File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2772, in fit_transform
self.fit(X, y)
File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2526, in fit
verbose=self.verbose,
File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 340, in nearest_neighbors
compressed=False,
File "/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py", line 780, in __init__
self._angular_trees,
File "/usr/local/lib/python3.7/dist-packages/pynndescent/rp_trees.py", line 994, in make_forest
for i in range(n_trees)
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1056, in __call__
self.retrieve()
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in __call__
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in <listcomp>
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 434, in _compile_for_args
raise e
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 367, in _compile_for_args
return self.compile(tuple(argtypes))
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 819, in compile
cres = self._compiler.compile(args, return_type)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 78, in compile
status, retval = self._compile_cached(args, return_type)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 92, in _compile_cached
retval = self._compile_core(args, return_type)
File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 110, in _compile_core
pipeline_class=self.pipeline_class)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 627, in compile_extra
return pipeline.compile_extra(func)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 363, in compile_extra
return self._compile_bytecode()
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 425, in _compile_bytecode
return self._compile_core()
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 405, in _compile_core
raise e
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 396, in _compile_core
pm.run(self.state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 341, in run
raise patched_exception
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 332, in run
self._runPass(idx, pass_inst, state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 291, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 264, in check
mangled = func(compiler_state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 442, in run_pass
NativeLowering().run_pass(state)
File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 372, in run_pass
lower.create_cpython_wrapper(flags.release_gil)
File "/usr/local/lib/python3.7/dist-packages/numba/core/lowering.py", line 244, in create_cpython_wrapper
release_gil=release_gil)
File "/usr/local/lib/python3.7/dist-packages/numba/core/cpu.py", line 161, in create_cpython_wrapper
builder.build()
File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 122, in build
self.build_wrapper(api, builder, closure, args, kws)
File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 176, in build_wrapper
obj = api.from_native_return(retty, retval, env_manager)
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1388, in from_native_return
out = self.from_native_value(typ, val, env_manager)
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1402, in from_native_value
return impl(typ, val, c)
File "/usr/local/lib/python3.7/dist-packages/numba/core/boxing.py", line 502, in box_namedtuple
cls_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.instance_class))
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1363, in serialize_object
struct = self.serialize_uncached(obj)
File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1334, in serialize_uncached
data = serialize.dumps(obj)
File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 168, in dumps
p.dump(obj)
File "/usr/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 314, in save
raise _TracedPicklingError(m)
numba.core.serialize._TracedPicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing...
[0]: <class 'type'>: 94387749868256
....
....
....
....
....
....
37%|███▋ | 11/30 [11:05<19:08, 60.46s/trial, best loss: 0.28700818853604954]
Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.
INFO:hyperopt-spark:Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
[<ipython-input-32-0354e1710e5c>](https://localhost:8080/#) in <module>()
20 label_lower=label_lower,
21 label_upper=label_upper,
---> 22 max_evals=max_evals)
[<ipython-input-30-e8f8e02ddb3a>](https://localhost:8080/#) in bayesian_search(embeddings, space, label_lower, label_upper, max_evals)
12 trials=hyperopt.SparkTrials())
13
---> 14 best_params = space_eval(space, best)
15 print ('best:')
16 print (best_params)
NameError: name 'space_eval' is not defined
edit 2:
if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.
Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.
I use
>>> import pyspark >>> import pynndescent >>> pynndescent.rp_trees.FlatTree.__module__ = "pynndescent.rp_trees"
then in
fmin
fromhyperopt
I set the following argument:>>> trials=hyperopt.SparkTrials(n_trials)
and the error follows. Please note that if
n_trials
areNone
or a negative number, it seems to work properly for some trials, otherwise it spits back this error.I'd like also to understand how can I choose to deactivate pynndescent from UMAP.
Thank you in advance.
BTW the error is:
Error Message edit 2:
if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.
Same, but I'm not sure hyperopt used in my project.
I speculate if it has something to do with pyspark's optimization of CPickle serialization in python < 3.8 environment?https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/serializers.py#L361
I am using pyspark 3.1.2 and umap-learn 0.5.2, python 3.7.10