kedro
kedro copied to clipboard
Make credentials a resolver
Description
Make "credentials" a OmegaConf
resolver, so injecting credentials in the catalog would look like :
# current injection
my_data:
type: pandas.CSVDataset
filepath: /path/to/data.csv
credentials: my_credentials # a string which refer to a a section in credentials.yaml
# new proposal
my_data:
type: pandas.CSVDataset
filepath: /path/to/data.csv
credentials: ${credentials: my_credentials}
Context
This may look like a pure technical refactoring, but actually I am expecting this to be an actual improvement for users, e.g. :
- improve readability : credentials are hard to grasp because it does some magic relying on the template. As of our old kedro motto A sprinkle of magic is better than a spoonful of it or the Zen of python Explicit is better than implicit, above syntax will help people to understand what is happening under the hood.
-
define logic for empty credentials : users will be able to handle "empty" credentials (often not provided when developing locally but necessary in production, this forces you to have empty keys during development), see https://github.com/kedro-org/kedro/issues/3634 for a user monkey patching credentials in
settings.py
- pass credentials object instead of strings: it is common to have some complicated credentials objects which are hard to express purely with a yaml description, see https://github.com/kedro-org/kedro/issues/1621
- enable custom logic to fetch credentials, which is often hardcoded in hooks right now : https://github.com/kedro-org/kedro/issues/2547
- improve sharing logic : a credentials is just a normal python function. Passing it as a resolver would enable people to create their own credentials fetching logic and override kedro's just by installing a package provided by their org.
- decouple the code of catalog and the config resolution: all the logic currently lives in the Datacatalog class (see below) , which does not look right. Moving this logic to the config would eventually helps https://github.com/kedro-org/kedro/issues/3659 if we want to go forward in this direction.
Overall, this would likely be a general improvement with some overlaps with what is described here https://github.com/kedro-org/kedro/issues/1646 and has never been properly adressed, despite a bunch of discussions around this topic.
Possible Implementation
Move the logic from the catalog and the _get_credentials
and _resolve_credentials
:
https://github.com/kedro-org/kedro/blob/7237dc79e6b77475f6f1296f4fdd9d6aad3e1d11/kedro/io/data_catalog.py#L35-L58
https://github.com/kedro-org/kedro/blob/7237dc79e6b77475f6f1296f4fdd9d6aad3e1d11/kedro/io/data_catalog.py#L61-L83
to a resolver.
Possible Alternatives
Do nothing and keep the situation as is.
I like the idea in principle, but what would be a way for us to prevent hardcoding of credentials into the catalog directly?
Not sure of what you mean : for me the default credentials provided by kedro would do the exact same things as it does now (map a string in the catalog to an entry in credentials.yml) so I don't think there is a problem here.
A user who likes playing with kedro's limits could eventually write his own resolver and eventually takes raw inputs directly in the catalog, but I don't think it's kedro's responsibility to try to circumvent all weird things users can do (and funnily, they already can write a custom resolver and use it in the catalog - my proposal would simply enable to make their dangerous logic the default, but they are clearly doing it on purpose). Do you have an example of a situation this proposal would enable which is not already possible and that we want to avoid by all means?
actually I am expecting this to be an actual improvement for users
Hard agree. Our YAML-first approach to credentials is weird, given that there are many other methods that are arguably more prevalent, like environment variables, secrets vaults, JSON token files...
Plus we have evidence that users struggle to understand the current approach https://github.com/kedro-org/kedro/issues/1646#issuecomment-1703023713
I think this is really elegant, I think there is probably a way where this is non-breaking and just an extension too
@Galileo-Galilei The way OmegaConf
resolvers work is that they eventually render the value to the dictionary you have provided. That means that in the end result, you get the credentials as part of the key there. The consumer (in this case Kedro) would see no difference between this:
# new proposal
my_data:
type: pandas.CSVDataset
filepath: /path/to/data.csv
credentials: ${credentials: my_credentials}
and this:
# new proposal
my_data:
type: pandas.CSVDataset
filepath: /path/to/data.csv
credentials: s3cre7!
Currently, the latter will not be taken by Kedro and it won't work, because we do the resolving after loading the config. This way we are preventing users from accidentally checking in secrets. If we go the resolver path (which I like personally), we need to make sure that my second example will error out and your example will only work.
This is a totally valid concern, I did not think about this edge case.
It seems it should be possible to check if the credentials subkey is interpolated with [OmegaConf.is_interpolation
method] (https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#omegaconf-is-interpolation) and raise an error if it isn't, but it will likely require #2973 to be completed first (and given the number of references to this issue, it feels like it will become a blocker at one point)
That’s exactly what I always do: I use credentials as env vars. however, it would still be nice to be able to inject them in a dataset (think db credentials for sql datasets) in some way.
Originally posted by @MatthiasRoels in https://github.com/kedro-org/kedro/issues/1936#issuecomment-2264923282
O also think a resolver based way of doing creds would be amazing, I had to hack something together to make things work for my SQLFrame experiement.
Hi Everyone,
As always, a lot of great thoughts & comments. 👍
If I may quickly had some poorly articulated concerns / suggestions that connects this issue to #2829:
AWS Secrets bills on READ ⚠️ 💵 , making it crucial that only credentials that are actually needed are being loaded.
Our problem is compounded by the fact that, in the @after_context_created
hook, there is unfortunately no access to tags, pipeline name, namespace ..., i.e data that would help in filtering things out and minimise AWS billing.
Looking very forward to reading your thoughts on this.
The suggestion make sense. We agree this is something that is useful, at the moment we are doing some re-design for DataCatalog, we will start looking at this once we finish the refactoring.
More evidence that our users are just using environment variables for credentials https://kedro.hall.community/support-lY6wDVhxGXNY/passing-credentials-to-a-node-in-kedro-DrfL3JQdGpPb
(Related to https://github.com/kedro-org/kedro/discussions/3979 cc @lordsoffallen)