kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Kedro versioning system throws (seemingly) randomly kedro.io.core.DatasetError to some versioned datasets

Open EloyID opened this issue 1 year ago • 1 comments

Description

I have encountered for some versioned datasets that Kedro throws an error kedro.io.core.DatasetError: Cannot save versioned dataset, even if there is no not-versioned dataset with the same name in the expected path. It actually creates the folder where to save the versioned dataset. In the image you can see the created folder that causes the error and a similar versioned dataset

image

Context

This error prevents me from being able to save some versioned datasets.

Steps to Reproduce


# the error causing one

X_train_batched_energy_pca_as_target_dataset_preprocessed:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl
  versioned: true
  metadata:
    kedro-viz:
      layer: train_data

# one correctly working

X_train_merged_input_data_preprocessed:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train_merged_input_data_preprocessed.pkl
  versioned: true
  metadata:
    kedro-viz:
      layer: train_data

Expected Result

Not raising the error and creating the dataset

Actual Result

The containing folder is created but it raises and error instead of creating the dataset

                   INFO     Saving data to                                                       data_catalog.py:525
                             X_train_batched_energy_pca_as_target_dataset_preprocessed
                             (PickleDataset)...

Traceback (most recent call last):
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\core.py", line 614, in save
    super().save(data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\core.py", line 214, in save
    self._save(data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro_datasets\pickle\pickle_dataset.py", line 225, in _save
    with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\spec.py", line 1295, in open
    f = self._open(
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\implementations\local.py", line 180, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\implementations\local.py", line 302, in __init__
    self._open()
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\implementations\local.py", line 307, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Path/to/my/kedroproject/energy-market-forecast/data/05_model_input/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl/2024-02-19T08.33.20.180Z/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\sequential_runner.py", line 75, in _run
    run_node(node, catalog, hook_manager, self._is_async, session_id)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 331, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 444, in _run_node_sequential
    catalog.save(name, data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\data_catalog.py", line 532, in save
    dataset.save(data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\core.py", line 618, in save
    raise DatasetError(
kedro.io.core.DatasetError: Cannot save versioned dataset 'X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl' to 'C:/Path/to/my/kedroproject/energy-market-forecast/data/05_model_input' because a file with the same name already exists in the directory. This is likely because versioning was enabled on a dataset already saved previously. Either remove 'X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl' from the directory or manually convert it into a versioned dataset by placing it in a versioned directory (e.g. with default versioning format 'C:/Path/to/my/kedroproject/energy-market-forecast/data/05_model_input/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl/YYYY-MM-DDThh.mm.ss.sssZ/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl').

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Path\to\my\windows\python\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Path\to\my\windows\python\lib\runpy.py", line 86, in _run_code       
    exec(code, run_globals)
  File "C:\Path\to\my\windows\python\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\cli\cli.py", line 198, in main
    cli_collection()
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\cli\cli.py", line 127, in main
    super().main(
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\cli\project.py", line 225, in run
    session.run(
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\session\session.py", line 392, in run
    run_result = runner.run(
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 117, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\sequential_runner.py", line 78, in _run
    self._suggest_resume_scenario(pipeline, done_nodes, catalog)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 206, in _suggest_resume_scenario
    start_p_persistent_ancestors = _find_persistent_ancestors(
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 249, in _find_persistent_ancestors
    if _has_persistent_inputs(current_node, catalog):
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 290, in _has_persistent_inputs
    if isinstance(catalog._datasets[node_input], MemoryDataset):
KeyError: 'pca_target_regression.trained_pca_target_regression'

Your Environment

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.19.2
  • Python version used (python -V): Python 3.10.13
  • Operating system and version: Microsoft Windows [Version 10.0.22621.1928]

Thank you for your help and your work, I really like Kedro!

EloyID avatar Feb 19 '24 09:02 EloyID

I am not 100% sure, but it is related to the length of the filename since it works when changing to short names but fails equally with random long names. Maybe the thrown error should be more explicit on this subject.

EloyID avatar Feb 20 '24 16:02 EloyID

https://stackoverflow.com/questions/62606023/filenotfounderror-on-long-pathname-in-python-in-windows

Closing this as this is a Window issue and Kedro cannot detect anything from the FileNotFound error.

noklam avatar Mar 05 '24 13:03 noklam