umap Pickling error when using UMAP with pynndescent in a Spark-based environment

When using UMAP (without pynndescent) in a Spark-based environment, it works fine.

Instead, when using UMAP with pynndescent, I get the following error:

_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed

Traceback (most recent call last):
  File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
    gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>

I use numba 0.50.1

Aug 11 '20 12:08 candalfigomoro

Hmm, this is a new one. I'll have to look into this but I can't promise any swift resolution.

Aug 14 '20 18:08 lmcinnes

I am also running into the same issue in a Spark based environment.

Feb 19 '21 23:02 koustavmukherjee

A number of pickling issues were resolved in pynndescent v0.5 and umap v0.5 now depends on that -- are you seeing the issue with the newer versions? If so it seems like something specific to the spark environment, which is a little more confusing.

Feb 20 '21 02:02 lmcinnes

Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?

Mar 09 '21 13:03 simon-slowik

Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?

Also having this problem with these exact versions

Mar 09 '21 17:03 Osherz5

Is it specifically complaining about the collections.FlatTree? If so I'll look and see if there is some reason that is still potentially causing issues.

Mar 09 '21 20:03 lmcinnes

Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?

No @simon-slowik unfortunately I didn't find a solution yet for working with more recent versions of umap-learn from within a spark based environment. For now, I resorted to an older version of umap-learn (0.4.6) and using it without pynndescent.

Mar 09 '21 23:03 koustavmukherjee

@lmcinnes yes the error message is specifically about collections.FlatTree.

It seems, pinning both umap-learn and pynndescent to 0.5.0 fixes the issue.

Mar 10 '21 10:03 simon-slowik

Thanks @simon-slowik . It is particularly interesting that pinning to 0.5 works. I'll see if I can get this fixed.

Mar 10 '21 16:03 lmcinnes

I was running into this today, and I think I found the cause. It looks like pyspark is overwriting collections.namedtuple with their own function, but that function isn't correctly setting the __module__ field.

Here's a minimal reproduction:

>>> # don't load `pyspark`
>>> import collections
>>> collections.namedtuple("n", [])
<class '__main__.n'>

>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [])
<class 'collections.n'>

I think there are a couple options for "fixing" this:

Fix the issue in pyspark: either set the __module__ correctly (as the stdlib does) or "un-hijack" collection.namedtuple after they're done with it.
Pass the module= kwarg to namedtuple as __name__. This avoids needing to make a change to pyspark but does require a change to pynndescent (seems like @lmcinnes is the maintainer of that one too, so it shouldn't be too hard :smile: ). For example:
```
>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [], module=__name__)
<class '__main__.n'>
```

In the meantime, users can monkey-patch pynndescent.rp_trees.FlatTree after importing it to get around the bug:

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
>>> ... # my code goes after this

This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.

@lmcinnes if you think (2) is a reasonable course of action, I'm happy to open a PR.

Jun 16 '21 03:06 kmurphy4

Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.

I use

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"

then in fmin from hyperopt I set the following argument:

>>> trials=hyperopt.SparkTrials(n_trials)

and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.

I'd like also to understand how can I choose to deactivate pynndescent from UMAP.

Thank you in advance.

BTW the error is:

Error Message

Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
WARNING:hyperopt-spark:Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.

  7%|▋         | 2/30 [01:36<21:36, 46.32s/trial, best loss: 0.29299980027960854]

trial task 2 failed, exception is Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
  PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing... 
 [0]: <class 'type'>: 94387749868256.
 Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 305, in save
    return super().save(obj)
  File "/usr/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.7/pickle.py", line 1016, in save_type
    return self.save_global(obj)
  File "/usr/lib/python3.7/pickle.py", line 960, in save_global
    (obj, module_name, name)) from None
_pickle.PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/hyperopt/spark.py", line 468, in run_task_on_executor
    params, ctrl=None, attach_attachments=False
  File "/usr/local/lib/python3.7/dist-packages/hyperopt/base.py", line 892, in evaluate
    rval = self.fn(pyll_rval)
  File "<ipython-input-19-360500d75c7a>", line 11, in objective
  File "<ipython-input-17-7920b40f7bd3>", line 13, in generate_clusters
  File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2772, in fit_transform
    self.fit(X, y)
  File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2526, in fit
    verbose=self.verbose,
  File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 340, in nearest_neighbors
    compressed=False,
  File "/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py", line 780, in __init__
    self._angular_trees,
  File "/usr/local/lib/python3.7/dist-packages/pynndescent/rp_trees.py", line 994, in make_forest
    for i in range(n_trees)
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/dist-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 434, in _compile_for_args
    raise e
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 367, in _compile_for_args
    return self.compile(tuple(argtypes))
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 819, in compile
    cres = self._compiler.compile(args, return_type)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 78, in compile
    status, retval = self._compile_cached(args, return_type)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 92, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 110, in _compile_core
    pipeline_class=self.pipeline_class)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 627, in compile_extra
    return pipeline.compile_extra(func)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 363, in compile_extra
    return self._compile_bytecode()
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 425, in _compile_bytecode
    return self._compile_core()
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 405, in _compile_core
    raise e
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 396, in _compile_core
    pm.run(self.state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 341, in run
    raise patched_exception
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 332, in run
    self._runPass(idx, pass_inst, state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 291, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 264, in check
    mangled = func(compiler_state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 442, in run_pass
    NativeLowering().run_pass(state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 372, in run_pass
    lower.create_cpython_wrapper(flags.release_gil)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/lowering.py", line 244, in create_cpython_wrapper
    release_gil=release_gil)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/cpu.py", line 161, in create_cpython_wrapper
    builder.build()
  File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 122, in build
    self.build_wrapper(api, builder, closure, args, kws)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 176, in build_wrapper
    obj = api.from_native_return(retty, retval, env_manager)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1388, in from_native_return
    out = self.from_native_value(typ, val, env_manager)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1402, in from_native_value
    return impl(typ, val, c)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/boxing.py", line 502, in box_namedtuple
    cls_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.instance_class))
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1363, in serialize_object
    struct = self.serialize_uncached(obj)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1334, in serialize_uncached
    data = serialize.dumps(obj)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 168, in dumps
    p.dump(obj)
  File "/usr/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 314, in save
    raise _TracedPicklingError(m)
numba.core.serialize._TracedPicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
  PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing... 
 [0]: <class 'type'>: 94387749868256
....
....
....
....
....
....

 37%|███▋      | 11/30 [11:05<19:08, 60.46s/trial, best loss: 0.28700818853604954]

Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.
INFO:hyperopt-spark:Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

[<ipython-input-32-0354e1710e5c>](https://localhost:8080/#) in <module>()
     20                 label_lower=label_lower,
     21                 label_upper=label_upper,
---> 22                 max_evals=max_evals)

[<ipython-input-30-e8f8e02ddb3a>](https://localhost:8080/#) in bayesian_search(embeddings, space, label_lower, label_upper, max_evals)
     12                 trials=hyperopt.SparkTrials())
     13 
---> 14     best_params = space_eval(space, best)
     15     print ('best:')
     16     print (best_params)

NameError: name 'space_eval' is not defined

edit 2:

if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.

Apr 09 '22 12:04 jamslaugh

Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.

I use
>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
then in fmin from hyperopt I set the following argument:
>>> trials=hyperopt.SparkTrials(n_trials)
and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.

I'd like also to understand how can I choose to deactivate pynndescent from UMAP.

Thank you in advance.

BTW the error is:

Error Message edit 2:

if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.

Same, but I'm not sure hyperopt used in my project.

I speculate if it has something to do with pyspark's optimization of CPickle serialization in python < 3.8 environment?https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/serializers.py#L361

I am using pyspark 3.1.2 and umap-learn 0.5.2, python 3.7.10

Apr 13 '22 08:04 Wh1isper

umap umap copied to clipboard

Pickling error when using UMAP with pynndescent in a Spark-based environment

umap
umap copied to clipboard