[BUG] Lazy load the `pyspark.ml` module breaks on Spark clusters
Describe the bug
After installing flytekitplugins-spark==1.10.3 we get this error on Spark tasks:
Traceback (most recent call last):
File "/opt/venv/bin/entrypoint.py", line 16, in <module>
from flytekit.configuration import (
File "/opt/venv/lib/python3.9/site-packages/flytekit/__init__.py", line 305, in <module>
load_implicit_plugins()
File "/opt/venv/lib/python3.9/site-packages/flytekit/__init__.py", line 301, in load_implicit_plugins
p.load()
File "/opt/venv/lib/python3.9/site-packages/importlib_metadata/__init__.py", line 184, in load
module = import_module(match.group('module'))
File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/venv/lib/python3.9/site-packages/flytekitplugins/spark/__init__.py", line 21, in <module>
from .pyspark_transformers import PySparkPipelineModelTransformer
File "/opt/venv/lib/python3.9/site-packages/flytekitplugins/spark/pyspark_transformers.py", line 7, in <module>
pyspark_ml = lazy_module("pyspark.ml")
File "/opt/venv/lib/python3.9/site-packages/flytekit/lazy_import/lazy_module.py", line 41, in lazy_module
loader = importlib.util.LazyLoader(spec.loader)
File "/usr/lib/python3.9/importlib/util.py", line 282, in __init__
self.__check_eager_loader(loader)
File "/usr/lib/python3.9/importlib/util.py", line 273, in __check_eager_loader
raise TypeError('loader must define exec_module()')
This doesn't repro locally (i.e. when only installing flytekitplugins-spark==1.10.3 and interacting with a python interpreter or running local tasks).
Expected behavior
Lazy-loading modules should work in all cases, including when running Spark tasks.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
🤖
Based on the traceback you provided, it seems like the issue is related to the lazy loading of the pyspark.ml module. The error message indicates that the loader must define an exec_module() method, which is a requirement for Python's importlib.util.LazyLoader.
One possible reason for this issue could be a difference in Python versions between your local environment and the Spark cluster. The importlib.util.LazyLoader class was introduced in Python 3.7 and has seen various changes in subsequent Python versions. If the Python version in your Spark cluster is different from your local environment, it might be causing this issue.
Another potential cause could be the specific version of Spark being used in the cluster environment. As per the context provided, the Spark version is 2.4.5. There might be compatibility issues between this version of Spark and the flytekitplugins-spark==1.10.3 package.
To further investigate this issue, could you please provide the following information:
- The Python version being used in your Spark cluster.
- The Python version being used in your local environment.
- The exact steps you are following to run Spark tasks.
This information will help in identifying the root cause of the issue and finding a suitable solution.
Sources
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.