kedro-plugins
kedro-plugins copied to clipboard
`PartitionDataset` Caching Support
Description
I have a node which returns dict[str, Callable] for kedro to save my partitioned data. I've often had cases where it was failing mid way due to edge case i didn't cover and execution starts from all over again.
Context
I would need this to speed up experimentation in kedro and reduce unnecessary costs which may occur by re-running the node.
Possible Implementation
Adding a new parameter to PartitionDataset to support skipping already existing files. Something like use_cache: True
Possible Alternatives
I can def inherit the class and implement this but i thought it would be useful feature to have it in the core code.
There's some discussion of this in #928.
I've written a couple custom datasets for this use case and for parallel processing of partitions, attached here in case they're helpful. https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a
There's some discussion of this in #928.
I've written a couple custom datasets for this use case and for parallel processing of partitions, attached here in case they're helpful. https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a
I think they're different. I am okay with sequential execution but I wanted to support continue where it is left off. Ideally it's easy to hack but seemed like a nice feature to have in kedro
Try the third RobustPartitionedDataset? It's patterned off of the builtin incremental dataset to address some edge cases. You can set it up like a regular PartitionedDataset, with the additional parameter behavior: complete_missing
mydataset:
type: <my-project>.datasets.robust_partitioned_dataset.RobustPartitionedDataset
path: ...
dataset:
type ...
behavior: complete_missing
https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a#file-robust_partitioned_dataset-py
Try the third RobustPartitionedDataset? It's patterned off of the builtin incremental dataset to address some edge cases. You can set it up like a regular PartitionedDataset, with the additional parameter
behavior: complete_missingmydataset: type:
.datasets.robust_partitioned_dataset.RobustPartitionedDataset path: ... dataset: type ... behavior: complete_missing https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a#file-robust_partitioned_dataset-py
Thanks for the pointers 🙌 As I said, I wasn't looking for a custom solution as this could be done with few line of changes in the original code. Issue is opened so that this could (potentially) be brought to core kedro not as a custom dataset solution.
Hey @lordsoffallen Thanks for this issue! Would this be something you'd be interested in working on?
Hey @lordsoffallen Thanks for this issue! Would this be something you'd be interested in working on?
Unfortunately, I won't have time soon to do this. :/
@lordsoffallen no worries! Feel free to work on it whenever you get a chance, I believe the team is focussed on issues related to the upcoming Kedro 1.0 release at the moment so this might not take priority before that on our end :)