kedro
kedro copied to clipboard
Use user defined default dataset factory pattern over the one from the runner
Description
Fix #3720
Development notes
TODO:
- [x] Update DataCatalog code
- [x] Update CLI commands code
- [x] Docs
- [x] Test
- [ ] Release notes
Wanted to get opinions on the proposed solution before proceeding with the pending tasks mentioned above^
Proposed Solution
When DataCatalog
is created (in from_config
)
- If the last pattern in the sorted patterns list is a catch-all pattern i.e. has specificity == 0, it is popped out and stored separately as the
DataCatalog._default_pattern
(This would be the user-defined default pattern, not from the runner)
[!NOTE]
This assumes that there is only one catch-all pattern in the user's catalog. If there are more, the last one will be popped which wouldn't ever be used anyway.❓ [Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?
When a dataset is being fetched
- Try to match is against the patterns in
DataCatalog._dataset_patterns
ORDataCatalog._default_pattern
- Resolve config again first from
_dataset_patterns
OR_default_pattern
When runner.run() is called
When kedro run
is executed, the runner creates a shallow copy of the catalog object by calling the DataCatalog.shallow_copy()
- Here the runner inserts a default dataset pattern e.g.
extra_dataset_patterns = {'{default}': MemoryDataset}
- IF there is already a
_default_pattern
- it's not needed to do anything - IF there isn't a user defined catch-all pattern - add it to the
DataCatalog._dataset_patterns
(this will then not emit a warning about the catch-all pattern being used over the runner default dataset)
For the catalog CLI commands
Similar logic as above has been copied for the matching of datasets to patterns
Checklist
- [ ] Read the contributing guidelines
- [ ] Signed off each commit with a Developer Certificate of Origin (DCO)
- [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [ ] Updated the documentation to reflect the code changes
- [ ] Added a description of this change in the
RELEASE.md
file - [ ] Added tests to cover my changes
- [ ] Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team
[Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?
Yes
Just dumping my thought here in case I am blocking the PRs. I have some questions about the shallow_copy
approach, but that was done long before this PR so I don't want to expand the discussion too much.
I think:
- There should be only 1 default dataset, validation should happen when
DataCatalog
is initialised. (I think this is consistent as theshallow_copy
will create a newDataCatalog
, this is where I find theshallow_copy
weird as it does not seem to be a reference but completely new object - multiple dataset pattern by Runner is fine. As long as only one of them is default datasets. For example, you could have different default for different namespaces which control by runner.
The most important question here is "should runner dataset patterns override DataCatalog
"? In framework setting, the way to override patterns is catalog.yml
. However, the order of how framework work is different.
-
DataCatalog
readcatalog.yml
first - Runner override default patterns (memory dataset)
This suggests that one should only use either catalog.yml
or runner
. Should it fails or just ignore the dataset pattern from runner? For me, I think I will go for raising an error instead of ignoring the pattern sliently.