Partitioned dataset does not work with parallel runner because of caching in exists method
Description
When I run a pipeline containing parallel datasets created during the run using the command kedro run --runner=ParallelRunner I get an error for the parallel datasets when they are loaded by subsequent nodes: DatasetError: No partitions found in '<path>'.
Digging into the problem, it seems to be because of the line with the call catalog.exists(dataset) when calling the method _set_manager_datasets in ParallelRunner. This will call the method exists on PartitionedDataset which in turn calls the method _list_partitions. This method has a cachedmethod decorator that causes subsequent calls to exists when running the pipeline to return False. Removing the cachedmethod decorator solves the issue.
It is unclear if this is a bug with PartitionedDataset or with ParallelRunner so please let me know if I should move this to the kedro repo instead.
Context
I cannot run my pipeline containing intermediate partitioned datasets using parallel runner. This blocks me from updating to kedro 0.19.
Steps to Reproduce
- Create a pipeline with intermediate datasets (created and consumed by subsequent nodes) of type
PartitionedDataset. - Run the pipeline using
kedro run --runner=ParallelRunner.
Expected Result
The pipeline should run with no errors.
Actual Result
The pipeline fails with
`DatasetError: No partitions found in '<path>'`
when trying to load the intermediate partitioned dataset.
Your Environment
- Kedro version used: version 0.19.3
- Kedro datasets used: version 2.1.0
- Python version used: Python 3.10.12
- Operating system and version: Ubuntu 22.04
Thank your for your efforts with Kedro!
@nilsbore Would you be able to provide a example repository that we can reproduce the result? In addition, what was the previous version that it works? AFAIK we haven't introduced changes to ParallelRunner or PartitionedDataset, so I would like to understand more is this a regression or a new bug.
I will see if I can put together an example project today. In the meantime, I'll address the other questions:
- The last versions where I tested and it works are
kedro 0.18.14andkedro-datasets 1.7.1 -
This commit added the
_set_manager_datasetslogic inParallelRunnermentioned above. Looks like it's been there since0.19.0.
So I think it's pretty safe to say it's a regression with kedro 0.19.
I created a minimal example here: https://github.com/nilsbore/kedro-parallel-partitioned-bug . If you run kedro run --runner=ParallelRunner with an empty data folder, it will crash with DatasetError: No partitions found in '/path/to/kedro-parallel-partitioned-bug/data/a'. Running using kedro run with a clean data folder works as expected.
@noklam Again, please let me know if I should move this to the kedro repo instead. Thanks for the help.
@nilsbore Sorry for the late reply! I miss Github notification all the time. You can always find me in our Slack (kedro.slack.org )
Thank you for the example. This make sense, I did see a few issues with the ParallelRunner due to SharedMemoryDataset since 0.19.0, maybe they are all related.
We can keep this in this repository as we monitor both repo with Github Project, I can transfer the issue if we confirm the changes should be done on the kedro repo instead.