kedro
kedro copied to clipboard
Added 'custom_args' attribute to AbstractDataset class
Description
The Inspiration
- In my experience with Kedro, perhaps one of the most useful CI/CD methods is selecting which artifacts (datasets) to save to experiment-specific folders (and which to avoid saving repeatedly, such as large training datasets).
- The best method I have found to accomplish this is through the creation of an AbstractDataset, to accept 'custom_args'.
- For example, I could create
project_name.custom_datasets.csv.CustomCSVDataset
and write the followingcatalog.yml
:predictions: type: project_name.custom_datasets.csv.CustomCSVDataset filepath: data/08_reporting/predictions.csv custom_args: save_to_mlruns: True
- I could then configure my code to dynamically save this dataset to
mlruns/08_reporting/predictions.csv
(and, similarly, do so for all datasets withsave_to_mlruns: True
in the catalog).
- For example, I could create
Personally, this is my most frequent (and favorite) application of "dataset-specific" args. Unfortunately, I find myself creating a custom class for each type of artifact (i.e. CSV, plotly, pickle, etc.), and do so again each time I create a new kedro project.
Broader Usage
The above is a very specific use of the proposed 'custom_args' feature, though I believe many developers would find it useful to have access to custom args without having to rewrite numerous custom classes. I know it was a popular feature among my former team members (for the dynamic saving method detailed above)!
Development notes
Given the minor extent of the change, I don't believe this merits an independent test. If I were to test it, however, I would test the instantiation of an AbstractDataset, a child of AbstractDataset (i.e. CSVDataSet), and ensure I could properly access the configured custom_args.
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by
line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
- [ ] Read the contributing guidelines
- [ ] Signed off each commit with a Developer Certificate of Origin (DCO)
- [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [ ] Updated the documentation to reflect the code changes
- [ ] Added a description of this change in the
RELEASE.md
file - [ ] Added tests to cover my changes
- [ ] Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team
Hi @noamgoldberg , thanks for your PR!
Before proceeding, could you have a look at the metadata
key and see if it would suit your needs? It's not part of the AbstractDataset
, but all derived datasets have it.
Hi @noamgoldberg , I echo what I said about metadata
in https://github.com/kedro-org/kedro/pull/3737#issuecomment-2095343920
About custom arguments, the preferred route would be to either use metadata
or define your own dataset.
I appreciate your pull request but I am closing it for now 🙏🏼 If you have further ideas on how to improve Kedro, please open a new Discussion in the "Discussions" tab and let's take it from there.
Thanks again!