kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

`PartitionDataset` Caching Support

Open lordsoffallen opened this issue 11 months ago • 7 comments

Description

I have a node which returns dict[str, Callable] for kedro to save my partitioned data. I've often had cases where it was failing mid way due to edge case i didn't cover and execution starts from all over again.

Context

I would need this to speed up experimentation in kedro and reduce unnecessary costs which may occur by re-running the node.

Possible Implementation

Adding a new parameter to PartitionDataset to support skipping already existing files. Something like use_cache: True

Possible Alternatives

I can def inherit the class and implement this but i thought it would be useful feature to have it in the core code.

lordsoffallen avatar Jan 03 '25 19:01 lordsoffallen

There's some discussion of this in #928.

I've written a couple custom datasets for this use case and for parallel processing of partitions, attached here in case they're helpful. https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a

fgassert avatar Jan 15 '25 15:01 fgassert

There's some discussion of this in #928.

I've written a couple custom datasets for this use case and for parallel processing of partitions, attached here in case they're helpful. https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a

I think they're different. I am okay with sequential execution but I wanted to support continue where it is left off. Ideally it's easy to hack but seemed like a nice feature to have in kedro

lordsoffallen avatar Jan 16 '25 10:01 lordsoffallen

Try the third RobustPartitionedDataset? It's patterned off of the builtin incremental dataset to address some edge cases. You can set it up like a regular PartitionedDataset, with the additional parameter behavior: complete_missing

mydataset:
  type: <my-project>.datasets.robust_partitioned_dataset.RobustPartitionedDataset
  path: ...
  dataset:
    type ...
  behavior: complete_missing

https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a#file-robust_partitioned_dataset-py

fgassert avatar Jan 16 '25 10:01 fgassert

Try the third RobustPartitionedDataset? It's patterned off of the builtin incremental dataset to address some edge cases. You can set it up like a regular PartitionedDataset, with the additional parameter behavior: complete_missing

mydataset: type: .datasets.robust_partitioned_dataset.RobustPartitionedDataset path: ... dataset: type ... behavior: complete_missing https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a#file-robust_partitioned_dataset-py

Thanks for the pointers 🙌 As I said, I wasn't looking for a custom solution as this could be done with few line of changes in the original code. Issue is opened so that this could (potentially) be brought to core kedro not as a custom dataset solution.

lordsoffallen avatar Jan 17 '25 11:01 lordsoffallen

Hey @lordsoffallen Thanks for this issue! Would this be something you'd be interested in working on?

ankatiyar avatar Apr 17 '25 12:04 ankatiyar

Hey @lordsoffallen Thanks for this issue! Would this be something you'd be interested in working on?

Unfortunately, I won't have time soon to do this. :/

lordsoffallen avatar Apr 17 '25 13:04 lordsoffallen

@lordsoffallen no worries! Feel free to work on it whenever you get a chance, I believe the team is focussed on issues related to the upcoming Kedro 1.0 release at the moment so this might not take priority before that on our end :)

ankatiyar avatar Apr 17 '25 14:04 ankatiyar