kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Assess `OmegaConf` as replacement for `anyconfig`

Open merelcht opened this issue 3 years ago • 11 comments

Introduction

As part of the work on improving configuration in Kedro we should assess alternatives for anyconfig. anyconfig was initially chosen because it supports reading configuration of lots of different types. In practice, most users use yml files and so we might be able to use an alternative library that offers better functionality for yml configuration.

Task

Assess the following alternatives ordered by preference:

  1. OmegaConf
  2. Hydra (specifically the compose API)
  3. DynaConf this is the least preferred, but already used in Kedro.

merelcht avatar Jun 30 '22 14:06 merelcht

@DavidRoschewitzQB it would be great to get your thoughts on Hydra here

datajoely avatar Jun 30 '22 15:06 datajoely

OmegaConf is part of the Hydra dependencies, is the compose feature coming from OmegaConf itself or hydra actually add more on it?

Found a OmegaConf Deck here.

noklam avatar Jun 30 '22 15:06 noklam

If you look at the Hydra compose API docs OmegaConf is an import: https://hydra.cc/docs/1.0/experimental/compose_api/#internaldocs-banner

datajoely avatar Jun 30 '22 15:06 datajoely

@datajoely Yes they are created by the same author, I think there may be chance that we only need OmegaConf for config loader.

If we want to enable hierarchical config then overriding nested config in CLI becomes relevant again.

noklam avatar Jun 30 '22 15:06 noklam

Thanks for the tag @datajoely. Happy to share some of our thoughts and considerations.

There are a few reasons why we chose to use hydra (specifically compose API) in our prototype (some of these might be pros or cons for configuration in kedro):

  • The entry point is is a single .yaml file, which serves as the "root" of all other configuration. Therefore there is no need to search / loop over various files with a pattern. This combination of files is done explicitly in config.
  • The defaults list is neat syntax and allows for importing config from other files, which could then be overriden by the user if desired. This creates a hierarchical config tree.
    • It is possible to then define under which key or location the imports are placed with packaging.
  • Due to way hydra generates a nested config dictionary, what we are treating as the namespace (essentially the location of any key in the tree) is generated by hydra and it can be extracted when, for example, passing to kedro.
  • Dependency injection is supported out of the box.

I'm certain OmegaConf would allow for most of this functionality, but potentially requiring additional logic on top of base OmegaConf. One thing we have not tried is leveraging OmegaConf functionality (e.g. custom resolvers) together with Hydra.

Some peculiarities that are good to be aware of:

  • hydra only recognises only .yaml (not .yml) files
  • hydra/OmegaConf only support value interpolation, not key interpolation (stackoverflow from main contributor)

And lastly one consideration (when comparing e.g. with jinja) is that there is no support for conditionals or looping - not necessarily a dealbreaker, but potentially a limitation.

Do let me know if I can help clarify any points further. Exciting topic!

DavidRoschewitzQB avatar Jul 01 '22 09:07 DavidRoschewitzQB

Super helpful, thank you

datajoely avatar Jul 01 '22 10:07 datajoely

First: thanks for your great work Kedro team!

We are just about to migrate a project making heavy use of OmegaConf to Kedro (if successful we'll adopt kedro as a project standard) and was just testing out integration of the two. So +1 for this!

I wanted to describe our use case of OmegaConf, since we seem to have more-complex-than-average config requirements, where OmegaConf has served us well (and we have considered adopting hydra).

  1. We expose somewhat complex configuration to users, for example:

    • an arbitrary-length list of data sources of some given types, with their respective options.

    • User can select from a set of use cases, corresponding to different chunks of business logic, each with their own options (including additional datasources).

  2. For the backend we have exposed most of the parameters of the business logic as well as parts of the control flow:

    • Several chained models / tasks with their respective parameters, including different versions of these. This corresponds to namespaced sub-pipelines, with their own sub-configs.

    • Open to e.g., data engineering ad-hoc solutions or customisations, by overriding the default datasource pointers

  3. This far we have opted to use the "environments" directories to denote different user-setup configs (with their respective catalogs). Kedro's configuration AFAIU is a bit "one-dimensional" for lack of better word; the order of precedence nicely covers the running environments (as in local, experiment A, B, ..., staging, production), but I have a feeling we have occupied this dimension encoding the user-setup. Not sure how this will play with the other aspects (local, experiment) that we wish to modify independently, but I believe Hydra could handle this nicely.

The features we use in OmegaConf that we would like to keep when moving to Kedro:

  1. Can't help putting this here: attribute access is very neat :-)

  2. OmegaConf supports typed configs in a nice way (what they call structured configs). We use this much as the conf/base in Kedro, with the config schema defined as nested dataclasses. Main benefits of using the structured config:

    • Python-style typing, throwing hard errors

    • The sub-config dataclasses are used as type hints of the "orchestrating functions" (correpsonding to Kedro pipelines)

    • Optional/required fields and default values are defined in the dataclass

    • OmegaConf have special types for string and value interpolation, which means that the logic of distributing config values to all places needed can be done in the dataclass using the yaml syntax.

  3. Omegaconf intepolation possibly has some additional features on top of jmsepath, but I'm not sure here:

    • relative paths ${.key_of_this_level}, ${..key_of_parent_level}

    • nesting of interpolation is allowed

  4. We have used custom resolvers, which are very powerful. However, my feeling is that these should be used with caution, since it might hide important logic in some obscure place (and before one knows it, one have a DSL without spec...)

  5. We have looked into Hydra but not adopted it yet, mostly because it is simple to get started with overrides in OmegaConf (for e.g., experimentation and comparison of parameter values or versions), and we haven't hit the wall here yet. Also AFAUI it enforces a config directory structure which I suspect would become messy in our case.

Nice-to-haves for OmegaConf / Hydra / other

(Note that I'm new to Kedro, so the below might not appreciate existing features or the ideas behind the design choices.)

  1. Support for the omegaconf interpolation features (I guess this is a given)

  2. An interface for "structured configs", to point to base config dataclass in the config loader.

  3. Possibly with a convention or way to select between different config schemas.

  4. Some flexibility in what parts of the catalog / config that are strict, so it's not all or nothing:

    • Important production IO datasets might benefit from being part of the structured config
    • Adding / removing e.g. local intermediate datasets will probably be harder to work with if the whole catalog / config schema is strict
    • (The structured config might be out of place in the catalog however, since we have typing of the parameters in the dataset classes)
  5. (Probably a bad idea discarded long ago for good reasons) Possibly allow for single-yaml-file overrides; in development there is a lot of switching between the various yaml-files in the conf directory, and it would be nice to prototype in one place.

  6. and +1 for some iteration and branching abilities (that are not jinja), even if this is a long shot. Like if Hydra's multi-run feature was part of the config language.

pierrejeden avatar Jul 01 '22 13:07 pierrejeden

Thank you so much for your insights @DavidRoschewitzQB and @pierrejeden ! I'm personally very new to OmegaConf and Hydra, so it's great to already hear from your side which features are working well for you. Will definitely reach out if I have any questions about your needs for config in Kedro.

merelcht avatar Jul 04 '22 09:07 merelcht

After looking into OmegaConf, Hydra and Dynaconf in more detail and discussing the functionalities of those libraries, it was decided that OmegaConf will be the best replacement for anyconfig.

Some aspects and features of OmegaConf that helped us make this decision:

  • OmegaConf is much more popular than anyconf and so we expect support and documentation for it to be better.
  • OmegaConf has built-in variable interpolation, which will allow users to use templating in their configuration files.
  • OmegaConf offers "resolvers" for more complex interpolation, both the built-in and ability to create custom resolvers. The resolver for environment variables will make it possible to pass credentials from environment variables to Kedro.
  • Ability to input config from CLI
  • These and other features will also minimise amount of config users need to write

Hydra offers things like hierarchical environments, but as a first step we'd like to introduce OmegaConf to solve as many user problems as possible. If we find additional features are needed, we can always decide to introduce Hydra because it uses OmegaConf under the hood.

Dynaconf was considered because it offers support for other file types than yml, but other than that it didn't seem to have a great benefit over OmegaConf.

The introduction of OmegaConf inside Kedro depends on a piece of user research to find out if users use other config types than yml. The outcome of that will determine whether the change will be breaking or not.

Our internal assessment is now completed, but please let us know your experiences and insights about OmegaConf or alternative configuration solutions. User perspectives are invaluable to us!

merelcht avatar Jul 12 '22 09:07 merelcht

Relevant community initiative from @heolin @patrykjarzyna https://github.com/webinterpret-ds/kedro-templar

datajoely avatar Jul 14 '22 10:07 datajoely

Dumping thought on this, OmegaConf may also make the CLI interface and the Python API more consistent with Nested Dict Update.

OmegaConf.select(cfg, "foo.bar.zonk") == 10, potentially it can accepts both nested dictionary or these nested-dot keys.

noklam avatar Aug 30 '22 09:08 noklam

Closing this issue in favour of https://github.com/kedro-org/kedro/issues/1868

merelcht avatar Sep 23 '22 11:09 merelcht