umap icon indicating copy to clipboard operation
umap copied to clipboard

Pickling error when using UMAP with pynndescent in a Spark-based environment

Open candalfigomoro opened this issue 4 years ago • 12 comments

When using UMAP (without pynndescent) in a Spark-based environment, it works fine.

Instead, when using UMAP with pynndescent, I get the following error:

_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed

Traceback (most recent call last):
  File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
    gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>

I use numba 0.50.1

candalfigomoro avatar Aug 11 '20 12:08 candalfigomoro

Hmm, this is a new one. I'll have to look into this but I can't promise any swift resolution.

lmcinnes avatar Aug 14 '20 18:08 lmcinnes

I am also running into the same issue in a Spark based environment.

koustavmukherjee avatar Feb 19 '21 23:02 koustavmukherjee

A number of pickling issues were resolved in pynndescent v0.5 and umap v0.5 now depends on that -- are you seeing the issue with the newer versions? If so it seems like something specific to the spark environment, which is a little more confusing.

lmcinnes avatar Feb 20 '21 02:02 lmcinnes

Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?

simon-slowik avatar Mar 09 '21 13:03 simon-slowik

Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?

Also having this problem with these exact versions

Osherz5 avatar Mar 09 '21 17:03 Osherz5

Is it specifically complaining about the collections.FlatTree? If so I'll look and see if there is some reason that is still potentially causing issues.

lmcinnes avatar Mar 09 '21 20:03 lmcinnes

Running into the same issue in a spark environment with umap-learn = 0.5.1 and pynndescent = 0.5.2. @koustavmukherjee Did you find any solution?

No @simon-slowik unfortunately I didn't find a solution yet for working with more recent versions of umap-learn from within a spark based environment. For now, I resorted to an older version of umap-learn (0.4.6) and using it without pynndescent.

koustavmukherjee avatar Mar 09 '21 23:03 koustavmukherjee

@lmcinnes yes the error message is specifically about collections.FlatTree.

It seems, pinning both umap-learn and pynndescent to 0.5.0 fixes the issue.

simon-slowik avatar Mar 10 '21 10:03 simon-slowik

Thanks @simon-slowik . It is particularly interesting that pinning to 0.5 works. I'll see if I can get this fixed.

lmcinnes avatar Mar 10 '21 16:03 lmcinnes

I was running into this today, and I think I found the cause. It looks like pyspark is overwriting collections.namedtuple with their own function, but that function isn't correctly setting the __module__ field.

Here's a minimal reproduction:

>>> # don't load `pyspark`
>>> import collections
>>> collections.namedtuple("n", [])
<class '__main__.n'>
>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [])
<class 'collections.n'>

I think there are a couple options for "fixing" this:

  1. Fix the issue in pyspark: either set the __module__ correctly (as the stdlib does) or "un-hijack" collection.namedtuple after they're done with it.
  2. Pass the module= kwarg to namedtuple as __name__. This avoids needing to make a change to pyspark but does require a change to pynndescent (seems like @lmcinnes is the maintainer of that one too, so it shouldn't be too hard :smile: ). For example:
    >>> import pyspark
    >>> import collections
    >>> collections.namedtuple("n", [], module=__name__)
    <class '__main__.n'>
    

In the meantime, users can monkey-patch pynndescent.rp_trees.FlatTree after importing it to get around the bug:

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
>>> ... # my code goes after this

This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.

@lmcinnes if you think (2) is a reasonable course of action, I'm happy to open a PR.

kmurphy4 avatar Jun 16 '21 03:06 kmurphy4

Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.

I use

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"

then in fmin from hyperopt I set the following argument:

>>> trials=hyperopt.SparkTrials(n_trials)

and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.

I'd like also to understand how can I choose to deactivate pynndescent from UMAP.

Thank you in advance.

BTW the error is:

Error Message
Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
WARNING:hyperopt-spark:Because the requested parallelism was None or a non-positive value, parallelism will be set to (1), which is Spark's default parallelism (1), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.

  7%|▋         | 2/30 [01:36<21:36, 46.32s/trial, best loss: 0.29299980027960854]

trial task 2 failed, exception is Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
  PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing... 
 [0]: <class 'type'>: 94387749868256.
 Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 305, in save
    return super().save(obj)
  File "/usr/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.7/pickle.py", line 1016, in save_type
    return self.save_global(obj)
  File "/usr/lib/python3.7/pickle.py", line 960, in save_global
    (obj, module_name, name)) from None
_pickle.PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/hyperopt/spark.py", line 468, in run_task_on_executor
    params, ctrl=None, attach_attachments=False
  File "/usr/local/lib/python3.7/dist-packages/hyperopt/base.py", line 892, in evaluate
    rval = self.fn(pyll_rval)
  File "<ipython-input-19-360500d75c7a>", line 11, in objective
  File "<ipython-input-17-7920b40f7bd3>", line 13, in generate_clusters
  File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2772, in fit_transform
    self.fit(X, y)
  File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 2526, in fit
    verbose=self.verbose,
  File "/usr/local/lib/python3.7/dist-packages/umap/umap_.py", line 340, in nearest_neighbors
    compressed=False,
  File "/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py", line 780, in __init__
    self._angular_trees,
  File "/usr/local/lib/python3.7/dist-packages/pynndescent/rp_trees.py", line 994, in make_forest
    for i in range(n_trees)
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/dist-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 434, in _compile_for_args
    raise e
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 367, in _compile_for_args
    return self.compile(tuple(argtypes))
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 819, in compile
    cres = self._compiler.compile(args, return_type)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 78, in compile
    status, retval = self._compile_cached(args, return_type)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 92, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py", line 110, in _compile_core
    pipeline_class=self.pipeline_class)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 627, in compile_extra
    return pipeline.compile_extra(func)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 363, in compile_extra
    return self._compile_bytecode()
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 425, in _compile_bytecode
    return self._compile_core()
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 405, in _compile_core
    raise e
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler.py", line 396, in _compile_core
    pm.run(self.state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 341, in run
    raise patched_exception
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 332, in run
    self._runPass(idx, pass_inst, state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 291, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/compiler_machinery.py", line 264, in check
    mangled = func(compiler_state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 442, in run_pass
    NativeLowering().run_pass(state)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/typed_passes.py", line 372, in run_pass
    lower.create_cpython_wrapper(flags.release_gil)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/lowering.py", line 244, in create_cpython_wrapper
    release_gil=release_gil)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/cpu.py", line 161, in create_cpython_wrapper
    builder.build()
  File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 122, in build
    self.build_wrapper(api, builder, closure, args, kws)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/callwrapper.py", line 176, in build_wrapper
    obj = api.from_native_return(retty, retval, env_manager)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1388, in from_native_return
    out = self.from_native_value(typ, val, env_manager)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1402, in from_native_value
    return impl(typ, val, c)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/boxing.py", line 502, in box_namedtuple
    cls_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.instance_class))
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1363, in serialize_object
    struct = self.serialize_uncached(obj)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/pythonapi.py", line 1334, in serialize_uncached
    data = serialize.dumps(obj)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 168, in dumps
    p.dump(obj)
  File "/usr/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/usr/local/lib/python3.7/dist-packages/numba/core/serialize.py", line 314, in save
    raise _TracedPicklingError(m)
numba.core.serialize._TracedPicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Failed to pickle because of
  PicklingError: Can't pickle <class 'collections.FlatTree'>: it's not found as collections.FlatTree
tracing... 
 [0]: <class 'type'>: 94387749868256
....
....
....
....
....
....

 37%|███▋      | 11/30 [11:05<19:08, 60.46s/trial, best loss: 0.28700818853604954]

Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.
INFO:hyperopt-spark:Total Trials: 30: 11 succeeded, 19 failed, 0 cancelled.

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

[<ipython-input-32-0354e1710e5c>](https://localhost:8080/#) in <module>()
     20                 label_lower=label_lower,
     21                 label_upper=label_upper,
---> 22                 max_evals=max_evals)

[<ipython-input-30-e8f8e02ddb3a>](https://localhost:8080/#) in bayesian_search(embeddings, space, label_lower, label_upper, max_evals)
     12                 trials=hyperopt.SparkTrials())
     13 
---> 14     best_params = space_eval(space, best)
     15     print ('best:')
     16     print (best_params)

NameError: name 'space_eval' is not defined

edit 2:

if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.

jamslaugh avatar Apr 09 '22 12:04 jamslaugh

Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.

I use

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"

then in fmin from hyperopt I set the following argument:

>>> trials=hyperopt.SparkTrials(n_trials)

and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.

I'd like also to understand how can I choose to deactivate pynndescent from UMAP.

Thank you in advance.

BTW the error is:

Error Message edit 2:

if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.

Same, but I'm not sure hyperopt used in my project.

I speculate if it has something to do with pyspark's optimization of CPickle serialization in python < 3.8 environment?https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/serializers.py#L361

I am using pyspark 3.1.2 and umap-learn 0.5.2, python 3.7.10

Wh1isper avatar Apr 13 '22 08:04 Wh1isper