flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Lazy load the `pyspark.ml` module breaks on Spark clusters

Open eapolinario opened this issue 1 year ago • 1 comments

Describe the bug

After installing flytekitplugins-spark==1.10.3 we get this error on Spark tasks:

Traceback (most recent call last):
  File "/opt/venv/bin/entrypoint.py", line 16, in <module>
    from flytekit.configuration import (
  File "/opt/venv/lib/python3.9/site-packages/flytekit/__init__.py", line 305, in <module>
    load_implicit_plugins()
  File "/opt/venv/lib/python3.9/site-packages/flytekit/__init__.py", line 301, in load_implicit_plugins
    p.load()
  File "/opt/venv/lib/python3.9/site-packages/importlib_metadata/__init__.py", line 184, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/venv/lib/python3.9/site-packages/flytekitplugins/spark/__init__.py", line 21, in <module>
    from .pyspark_transformers import PySparkPipelineModelTransformer
  File "/opt/venv/lib/python3.9/site-packages/flytekitplugins/spark/pyspark_transformers.py", line 7, in <module>
    pyspark_ml = lazy_module("pyspark.ml")
  File "/opt/venv/lib/python3.9/site-packages/flytekit/lazy_import/lazy_module.py", line 41, in lazy_module
    loader = importlib.util.LazyLoader(spec.loader)
  File "/usr/lib/python3.9/importlib/util.py", line 282, in __init__
    self.__check_eager_loader(loader)
  File "/usr/lib/python3.9/importlib/util.py", line 273, in __check_eager_loader
    raise TypeError('loader must define exec_module()')

This doesn't repro locally (i.e. when only installing flytekitplugins-spark==1.10.3 and interacting with a python interpreter or running local tasks).

Expected behavior

Lazy-loading modules should work in all cases, including when running Spark tasks.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

eapolinario avatar Feb 10 '24 00:02 eapolinario

🤖

Based on the traceback you provided, it seems like the issue is related to the lazy loading of the pyspark.ml module. The error message indicates that the loader must define an exec_module() method, which is a requirement for Python's importlib.util.LazyLoader.

One possible reason for this issue could be a difference in Python versions between your local environment and the Spark cluster. The importlib.util.LazyLoader class was introduced in Python 3.7 and has seen various changes in subsequent Python versions. If the Python version in your Spark cluster is different from your local environment, it might be causing this issue.

Another potential cause could be the specific version of Spark being used in the cluster environment. As per the context provided, the Spark version is 2.4.5. There might be compatibility issues between this version of Spark and the flytekitplugins-spark==1.10.3 package.

To further investigate this issue, could you please provide the following information:

  1. The Python version being used in your Spark cluster.
  2. The Python version being used in your local environment.
  3. The exact steps you are following to run Spark tasks.

This information will help in identifying the root cause of the issue and finding a suitable solution.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Feb 10 '24 00:02 dosubot[bot]