kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Added 'custom_args' attribute to AbstractDataset class

Open noamgoldberg opened this issue 10 months ago • 1 comments

Description

The Inspiration

  • In my experience with Kedro, perhaps one of the most useful CI/CD methods is selecting which artifacts (datasets) to save to experiment-specific folders (and which to avoid saving repeatedly, such as large training datasets).
  • The best method I have found to accomplish this is through the creation of an AbstractDataset, to accept 'custom_args'.
    • For example, I could create project_name.custom_datasets.csv.CustomCSVDataset and write the following catalog.yml:
      predictions:
          type: project_name.custom_datasets.csv.CustomCSVDataset
          filepath: data/08_reporting/predictions.csv
          custom_args:
              save_to_mlruns: True
      
    • I could then configure my code to dynamically save this dataset to mlruns/08_reporting/predictions.csv (and, similarly, do so for all datasets with save_to_mlruns: True in the catalog).

Personally, this is my most frequent (and favorite) application of "dataset-specific" args. Unfortunately, I find myself creating a custom class for each type of artifact (i.e. CSV, plotly, pickle, etc.), and do so again each time I create a new kedro project.

Broader Usage

The above is a very specific use of the proposed 'custom_args' feature, though I believe many developers would find it useful to have access to custom args without having to rewrite numerous custom classes. I know it was a popular feature among my former team members (for the dynamic saving method detailed above)!

Development notes

Given the minor extent of the change, I don't believe this merits an independent test. If I were to test it, however, I would test the instantiation of an AbstractDataset, a child of AbstractDataset (i.e. CSVDataSet), and ensure I could properly access the configured custom_args.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • [ ] Read the contributing guidelines
  • [ ] Signed off each commit with a Developer Certificate of Origin (DCO)
  • [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [ ] Updated the documentation to reflect the code changes
  • [ ] Added a description of this change in the RELEASE.md file
  • [ ] Added tests to cover my changes
  • [ ] Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

noamgoldberg avatar Mar 29 '24 00:03 noamgoldberg

Hi @noamgoldberg , thanks for your PR!

Before proceeding, could you have a look at the metadata key and see if it would suit your needs? It's not part of the AbstractDataset, but all derived datasets have it.

astrojuanlu avatar Apr 03 '24 09:04 astrojuanlu

Hi @noamgoldberg , I echo what I said about metadata in https://github.com/kedro-org/kedro/pull/3737#issuecomment-2095343920

About custom arguments, the preferred route would be to either use metadata or define your own dataset.

I appreciate your pull request but I am closing it for now 🙏🏼 If you have further ideas on how to improve Kedro, please open a new Discussion in the "Discussions" tab and let's take it from there.

Thanks again!

astrojuanlu avatar May 06 '24 07:05 astrojuanlu