kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

Partitioned dataset does not work with parallel runner because of caching in exists method

Open nilsbore opened this issue 1 year ago • 5 comments

Description

When I run a pipeline containing parallel datasets created during the run using the command kedro run --runner=ParallelRunner I get an error for the parallel datasets when they are loaded by subsequent nodes: DatasetError: No partitions found in '<path>'.

Digging into the problem, it seems to be because of the line with the call catalog.exists(dataset) when calling the method _set_manager_datasets in ParallelRunner. This will call the method exists on PartitionedDataset which in turn calls the method _list_partitions. This method has a cachedmethod decorator that causes subsequent calls to exists when running the pipeline to return False. Removing the cachedmethod decorator solves the issue.

It is unclear if this is a bug with PartitionedDataset or with ParallelRunner so please let me know if I should move this to the kedro repo instead.

Context

I cannot run my pipeline containing intermediate partitioned datasets using parallel runner. This blocks me from updating to kedro 0.19.

Steps to Reproduce

  1. Create a pipeline with intermediate datasets (created and consumed by subsequent nodes) of type PartitionedDataset.
  2. Run the pipeline using kedro run --runner=ParallelRunner.

Expected Result

The pipeline should run with no errors.

Actual Result

The pipeline fails with

`DatasetError: No partitions found in '<path>'`

when trying to load the intermediate partitioned dataset.

Your Environment

  • Kedro version used: version 0.19.3
  • Kedro datasets used: version 2.1.0
  • Python version used: Python 3.10.12
  • Operating system and version: Ubuntu 22.04

Thank your for your efforts with Kedro!

nilsbore avatar Mar 21 '24 09:03 nilsbore

@nilsbore Would you be able to provide a example repository that we can reproduce the result? In addition, what was the previous version that it works? AFAIK we haven't introduced changes to ParallelRunner or PartitionedDataset, so I would like to understand more is this a regression or a new bug.

noklam avatar Mar 25 '24 13:03 noklam

I will see if I can put together an example project today. In the meantime, I'll address the other questions:

  1. The last versions where I tested and it works are kedro 0.18.14 and kedro-datasets 1.7.1
  2. This commit added the _set_manager_datasets logic in ParallelRunner mentioned above. Looks like it's been there since 0.19.0.

So I think it's pretty safe to say it's a regression with kedro 0.19.

nilsbore avatar Mar 26 '24 07:03 nilsbore

I created a minimal example here: https://github.com/nilsbore/kedro-parallel-partitioned-bug . If you run kedro run --runner=ParallelRunner with an empty data folder, it will crash with DatasetError: No partitions found in '/path/to/kedro-parallel-partitioned-bug/data/a'. Running using kedro run with a clean data folder works as expected.

nilsbore avatar Mar 26 '24 12:03 nilsbore

@noklam Again, please let me know if I should move this to the kedro repo instead. Thanks for the help.

nilsbore avatar Apr 10 '24 14:04 nilsbore

@nilsbore Sorry for the late reply! I miss Github notification all the time. You can always find me in our Slack (kedro.slack.org )

Thank you for the example. This make sense, I did see a few issues with the ParallelRunner due to SharedMemoryDataset since 0.19.0, maybe they are all related.

We can keep this in this repository as we monitor both repo with Github Project, I can transfer the issue if we confirm the changes should be done on the kedro repo instead.

noklam avatar Apr 10 '24 14:04 noklam