kedro
kedro copied to clipboard
PySpark is not being included in requirements.txt file in a new kedro project
Description
After starting a new kedro project with all the packages selected, I went into my project folder to install the requirements and PySpark isn't being installed because it's not included in the list of packages.
Context
The lack of PySpark is preventing the application from running.
Steps to Reproduce
- python -m venv venv
- ./venv/Scripts/activate.ps1
- pip install kedro
- kedro new
- select all packages and answer yes to pipeline example
- cd app
- pip install -r requirements.txt
- kedro run
Expected Result
Open application
Actual Result
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in _run_module_as_main:198 │
│ in _run_code:88 │
│ │
│ in <module>:7 │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\cli\cli.py:233 in main │
│ │
│ 230 │ cli_collection = KedroCLI( │
│ 231 │ │ project_path=_find_kedro_project(Path.cwd()) or Path.cwd() │
│ 232 │ ) │
│ ❱ 233 │ cli_collection() │
│ 234 │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1157 in __call__ │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\cli\cli.py:130 in main │
│ │
│ 127 │ │ ) │
│ 128 │ │ │
│ 129 │ │ try: │
│ ❱ 130 │ │ │ super().main( │
│ 131 │ │ │ │ args=args, │
│ 132 │ │ │ │ prog_name=prog_name, │
│ 133 │ │ │ │ complete_var=complete_var, │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1078 in main │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1688 in invoke │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1434 in invoke │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:783 in invoke │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\cli\project.py:222 in run │
│ │
│ 219 │ tuple_tags = tuple(tags) │
│ 220 │ tuple_node_names = tuple(node_names) │
│ 221 │ │
│ ❱ 222 │ with KedroSession.create( │
│ 223 │ │ env=env, conf_source=conf_source, extra_params=params │
│ 224 │ ) as session: │
│ 225 │ │ session.run( │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\session\session.py:151 in create │
│ │
│ 148 │ │ Returns: │
│ 149 │ │ │ A new ``KedroSession`` instance. │
│ 150 │ │ """ │
│ ❱ 151 │ │ validate_settings() │
│ 152 │ │ │
│ 153 │ │ session = cls( │
│ 154 │ │ │ project_path=project_path, │
│ │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\project\__init__.py:293 in │
│ validate_settings │
│ │
│ 290 │ │ ) │
│ 291 │ # Check if file exists, if it does, validate it. │
│ 292 │ if importlib.util.find_spec(f"{PACKAGE_NAME}.settings") is not None: │
│ ❱ 293 │ │ importlib.import_module(f"{PACKAGE_NAME}.settings") │
│ 294 │ else: │
│ 295 │ │ logger = logging.getLogger(__name__) │
│ 296 │ │ logger.warning("No 'settings.py' found, defaults will be used.") │
│ │
│ C:\Users\vitor\AppData\Local\Programs\Python\Python311\Lib\importlib\__init__.py:126 in │
│ import_module │
│ │
│ 123 │ │ │ if character != '.': │
│ 124 │ │ │ │ break │
│ 125 │ │ │ level += 1 │
│ ❱ 126 │ return _bootstrap._gcd_import(name[level:], package, level) │
│ 127 │
│ 128 │
│ 129 _RELOADING = {} │
│ in _gcd_import:1204 │
│ in _find_and_load:1176 │
│ in _find_and_load_unlocked:1147 │
│ in _load_unlocked:690 │
│ in exec_module:940 │
│ in _call_with_frames_removed:241 │
│ │
│ F:\Testes\kedro-test\api\src\api\settings.py:6 in <module> │
│ │
│ 3 https://docs.kedro.org/en/stable/kedro_project_setup/settings.html.""" │
│ 4 │
│ 5 # Instantiated project hooks. │
│ ❱ 6 from api.hooks import SparkHooks # noqa: E402 │
│ 7 │
│ 8 # Hooks are executed in a Last-In-First-Out (LIFO) order. │
│ 9 HOOKS = (SparkHooks(),) │
│ │
│ F:\Testes\kedro-test\api\src\api\hooks.py:2 in <module> │
│ │
│ 1 from kedro.framework.hooks import hook_impl │
│ ❱ 2 from pyspark import SparkConf │
│ 3 from pyspark.sql import SparkSession │
│ 4 │
│ 5 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'pyspark'
Your Environment
- Kedro version used (
pip show kedro
orkedro -V
): kedro, version 0.19.5 - Python version used (
python -V
): Python 3.11.7 - Operating system and version: Windows 11 Home Single Language 23H2 22631.3527
Hi @thorugo-code, thanks for opening this issue. I'm sorry you're facing problems getting started with Kedro. It is actually not expected that pyspark
is added to the requirements.txt
. Instead, we'd expect:
kedro-datasets[spark-sparkdataset]>=3.0; python_version >= "3.9"
kedro-datasets[spark.SparkDataset]>=1.0; python_version < "3.9"
to be added. Our SparkDataset
had a dependency on pyspark
, so this becomes a dependency in that way. I've replicated the steps and on my side pyspark
is successfully installed without needing to make any alterations. Could you share your resulting requirements.txt
file?
Closing this due to inactivity. Feel free to re-open this issue if you're facing the same problem.
This software need to be more user-friendly and reliable. You can not reach mass market with this complicate software.
Hi @maximilian22x, thank you for your feedback. Could you please provide more specific details on which aspects of Kedro you find complicated or unreliable? Are there particular features or areas where you believe the user experience could be improved?