dvc icon indicating copy to clipboard operation
dvc copied to clipboard

exp run: Support composing and dumping Hydra config.

Open daavoo opened this issue 3 years ago β€’ 11 comments

The feature will be used depending on whether config.hydra.output_file has been set or not.

Uses hydra.initialize_config_dir and hydra.compose (from https://hydra.cc/docs/advanced/compose_api/) to build the config and dump it to config.hydra.output_file.

config.hydra.config_dir and config.hydra.config_name can be used to customize the values passed to hydra.initialize_config_dir and hydra.compose.

Can be combined with --set-param overrides.

Closes #8082


  • [X] ❗ I have followed the Contributing to DVC checklist.

  • [ ] πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™


Depends on #8067

daavoo avatar Aug 04 '22 11:08 daavoo

@dberenbaum please take a look

daavoo avatar Aug 04 '22 11:08 daavoo

Nice work @daavoo! I really like how it achieves two important product goals:

  • Makes DVC easier and more flexible for complex configuration.
  • Makes it easier to track final parameters for Hydra users.

I came across two big questions when trying it:

  • Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the hydra options enabled, but I guess this risks leaving in parameters that you wanted to drop πŸ€”.
  • Is this the simplest/best way to handle custom files (not params.yaml)? The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users. Maybe instead of hydra.output_file, we need an option to configure the default params file (see #7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.

More minor questions:

  • Should it work for repro πŸ€” ? If not, we should be able to explain.
  • How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?

Interesting side effect: dvc exp run -S parameters persist to the next experiment by default but Hydra will reset to default each experiment. I prefer the Hydra behavior and wonder how we can simplify the experience enough to make it the default (separate from this PR). πŸ€”

dberenbaum avatar Aug 06 '22 19:08 dberenbaum

Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the hydra options enabled, but I guess this risks leaving in parameters that you wanted to drop πŸ€”.

I thought about merging but found that it would be confusing when combined with -set-param to determine which arguments are meant to be passed to hydra and which are meant to override the current status of the output_file.

Perhaps it would be better to clearly establish that hydra output_file will be overwritten on each exp run invocation. Users can move the hardcoded values of output_file to hydra.config_name and the result would be the same as merging.

daavoo avatar Aug 08 '22 10:08 daavoo

Should it work for repro πŸ€” ? If not, we should be able to explain.

It should not, because of the same reason --set-param only works for exp run. We are creating/editing a file and it's required to track the changes with git before running the pipeline

daavoo avatar Aug 08 '22 11:08 daavoo

How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?

Did you find any particular issue?

I copy-pasted the structure of the link to multiconf folder.

[hydra]
    output_file = params.yaml
    config_dir = multiconf
# dvc.yaml
stages:
  train:
    cmd: python train.py
    deps:
    - train.py
    params:
    - params.yaml:
# train.py
import dvc.api

def my_app() -> None:
    print(dvc.api.params_show())

if __name__ == "__main__":
    my_app()
$ dvc exp run
Running stage 'train':                                                
> python train.py
{'server': {'site': {'fb': {'domain': 'facebook.com'}, 'google': {'domain': 'google.com'}}, 'host': 'localhost', 'port': 443}}
$ dvc exp run -S 'server/site=[google,amazon]'
Running stage 'train':                                                
> python train.py
{'server': {'site': {'google': {'domain': 'google.com'}, 'amazon': {'domain': 'amazon.com'}}, 'host': 'localhost', 'port': 443}}

daavoo avatar Aug 08 '22 12:08 daavoo

Is this the simplest/best way to handle custom files (not params.yaml)?

My main motivation for going with hydra.output_file was to make it explicit, as I consider the composing and dumping a somehow aggressive operation. In addition, the compose and dump should also happen without --set-param, so users should have a way of toggling the behavior.

The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users.

Is there something specific that resulted obscure to you? I am not sure how much it comes from --set-param defaulting to params.yaml when no args and how much by adding the hydra.output_file.

What do you think about having a special key for --set-param named hydra (dvc exp run --set-param hydra:foo=1), which would explicitly indicate that those are args to be passed for the hydra compose and dump feature?? This could also alleviate https://github.com/iterative/dvc/pull/8093#issuecomment-1207971509 .

Maybe instead of hydra.output_file, we need an option to configure the default params file (see https://github.com/iterative/dvc/issues/7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.

Not sure I like the idea of entangling the hydra compose and dump with the DVC defaults parameters file.

daavoo avatar Aug 08 '22 13:08 daavoo

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

dberenbaum avatar Aug 08 '22 17:08 dberenbaum

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.

daavoo avatar Aug 08 '22 17:08 daavoo

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.

Another issue with configuring at the .dvc level is that it can be impractical in scenarios with multiple dvc.yaml (i.e. monorepo) unless dvc init --subdir is used.

daavoo avatar Aug 09 '22 08:08 daavoo

Some ideas for enabling this at the stage level:

  • Consider hydra a reserved key inside params.

If hydra is present in params of a stage, trigger the compose and dump functionality.

stages:
  train: 
    cmd: python train.py ${train_config}
    params:
        - hydra

Q) How to configure?

Something like the following might look appealing but completely change the semantics / internal logic of params:

stages:
  train: 
    cmd: python train.py ${train_config}
    params:
        - hydra:
            output_file: params.yaml
            config_dir: conf 
  • Introduce callbacks?

There were old discussions about this. But basically, the hydra compose and dump could be implemented/understood as a stage callback:

stages:
  train: 
    cmd: python train.py ${train_config}
    callbacks:
      - hydra:
          output_file: params.yaml
          config_dir: conf
    params:
        - train_config

daavoo avatar Aug 09 '22 09:08 daavoo

Pros and cons of the dvc.yaml approach:

Pros

  • More granular control over where and how to use Hydra. For example, I could use the same hydra config_dir in two stages with different output_files for each and specify different --set-param values for each.
  • More transparent to other users that Hydra is being used.

Cons

  • If reusing the same config across stages or dvc.yaml files, it becomes repetitive (is this what you meant by "impractical in scenarios with multiple dvc.yaml? @daavoo?). For example, if I want to use the same hydra config across many stages, I need to add the hydra-specific section to each one.
  • Can't be used outside of stages (such as in foreach or global vars). This restricts its usage and makes it hard to configure for complex scenarios, which is where Hydra is needed most.
  • Harder to run without Hydra (more tightly coupled).

IMO the most likely and useful way to integrate Hydra and DVC is to have a massive Hydra config dir that gets reused throughout the project. For existing Hydra users, they are more likely to have a single Hydra config dir, and it's less work to integrate Hydra if it doesn't have to be configured at each stage in the pipeline. For that scenario, configuring at the project level in .dvc/config seems better, but we can also try to get feedback from users here.

dberenbaum avatar Aug 09 '22 20:08 dberenbaum

For future reference, I set up https://github.com/dberenbaum/complex_config_example when I was testing this out πŸ˜… (edit: main branch uses current dvc functionality, hydra branch uses this PR)

dberenbaum avatar Aug 12 '22 17:08 dberenbaum

Latest update:

  • config.hydra.config_dir is the config option that enables/disables the hydra compose and dump behavior.
  • By default, composed config will be dumped to params.yaml. config.hydra.output_file is now optional and can be used for changing the default behavior. Contents of output_file will be overwritten.

daavoo avatar Aug 17 '22 14:08 daavoo

  • By default, composed config will be dumped to params.yaml. config.hydra.output_file is now optional and can be used for changing the default behavior. Contents of output_file will be overwritten.

I'm worried this will confuse more people than it helps. Once set, it seems hard to explain how it works and clunky to have to use -S custom_params.yaml:... for all Hydra overrides. If you feel it's needed, maybe you have a better idea of how to clearly explain the behavior?

We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in #7939. Is there a use case for having a different default parameters file and Hydra parameters file?

dberenbaum avatar Aug 19 '22 15:08 dberenbaum

We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in https://github.com/iterative/dvc/issues/7939. Is there a use case for having a different default parameters file and Hydra parameters file?

I don't have a specific use case in mind. I guess I was trying to cover most scenarios. I can remove the option from the P.R. and wait for people to ask (if ever happens)

daavoo avatar Aug 19 '22 15:08 daavoo

Hi and thank you for working on this! As a user of both Hydra and DVC I thought it might be helpful to contribute a realistic use case. It may already be handled well by the current implementation, I don’t know.

Hydra’s great for modular projects where interchangeable modules may have heterogeneous configuration options. For example, I may want to use either a random forest or a multi-layer perceptron as my model architecture. Its two config files may look like:

# config_dir/model/rf.yaml

_target_: my_package.models.RandomForest
tree_depth: 3
# config_dir/model/mlp.yaml

_target_: my_package.models.MLP
n_hidden: 128

I’d like to now run a sequence of experiments.

  1. RF with tree_depth=4.
  2. MLP with n_hidden=192
  3. RF with tree_depth=4, again (I forgot about running it earlier, say)

The final experiment should be immediately recovered from cache but experiment 2 may have overwritten the value of model.n_hidden inside whatever file’s tracking DVC params. So while 1 was run with model.tree_depth=4 model.n_hidden=128, 3 will be instantiated with model.tree_depth=4 model.n_hidden=192. But the two settings describe fundamentally the same experiment, since n_hidden is not used by RF at all.

Is this case correctly handled by your parameter storage?

d-miketa avatar Aug 20 '22 09:08 d-miketa

Is this case correctly handled by your parameter storage?

@d-miketa I think so, with proper setup. I am making some assumptions here based on what you shared (like the use of hydra.utils.instantiate).

Here is the setup:

`conf` directory
$ tree conf
conf
β”œβ”€β”€ config.yaml
└── model
    β”œβ”€β”€ mlp.yaml
    └── rf.yaml
$ cat conf/config.yaml 
defaults:
  - model: rf
$ cat conf/model/mlp.yaml 
_target_: models.MLP
n_hidden: 128                                                    
$ cat conf/model/rf.yaml  
_target_: models.RandomForest
tree_depth: 3                                                       
`models/__init__.py`
@dataclass
class MLP:
    n_hidden: int

@dataclass
class RandomForest:
    tree_depth: int                                 
`train.py`
import dvc.api
from hydra.utils import instantiate

def my_app() -> None:
    model = instantiate(dvc.api.params_show()["model"])
    print(model)

if __name__ == "__main__":
    my_app()
`dvc.yaml`
stages:
  train:
    cmd: python train.py
    params:
    - params.yaml:

So, with that, I run:

dvc config hydra.config_dir conf
$ dvc exp run -S model.tree_depth=4 
...
> python train.py
RandomForest(tree_depth=4)  
...
$ dvc exp run -S model=mlp -S model.n_hidden=192 
...
> python train.py
MLP(n_hidden=192) 
...
$ dvc exp run -S model.tree_depth=5
...
> python train.py
RandomForest(tree_depth=5)  
...

And:

$ dvc exp show
 ────────────────────────────────────────────────────────────────────────────────────────────── 
  Experiment                Created    model._target_        model.tree_depth   model.n_hidden  
 ────────────────────────────────────────────────────────────────────────────────────────────── 
  workspace                 -          models.RandomForest   5                  -               
  master                    11:47 AM   models.RandomForest   3                  -               
  β”œβ”€β”€ ea452af [exp-bf727]   11:48 AM   models.RandomForest   5                  -               
  β”œβ”€β”€ bb97bdf [exp-c8cde]   11:48 AM   models.MLP            -                  192             
  └── 730ecec [exp-5b346]   11:48 AM   models.RandomForest   4                  -               
 ────────────────────────────────────────────────────────────────────────────────────────────── 

daavoo avatar Aug 22 '22 09:08 daavoo

  • config.hydra.config_dir is the config option that enables/disables the hydra compose and dump behavior.

Sorry to come back to this so late, but I'm still slightly confused.

hydra.config_dir acts as both a way to turn on the feature and to set the dir. However, hydra.config_name is optional, has a default value, and doesn't impact whether the feature is turned on.

This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:

  • Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).
  • Requiring both flags to be set to turn on the behavior, and raising a helpful error if only one is set.

dberenbaum avatar Aug 27 '22 12:08 dberenbaum

This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:

- Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).

I think this one option has the cleanest UX. Added config.hydra.enabled and made config_dir and config_name both optional with defaults

daavoo avatar Aug 29 '22 08:08 daavoo

@daavoo Ahhh that’s amazing! So good to see the integration’s naturally taking care of this problem - you clearly made some sound architectural choices. 😁

d-miketa avatar Aug 29 '22 11:08 d-miketa

@iterative/dvc Could someone review please?

dberenbaum avatar Aug 29 '22 12:08 dberenbaum

@dtrifiro @skshetry Could you both take a look, please?

efiop avatar Aug 29 '22 13:08 efiop

@daavoo, I have rebased it to main, and added some skips to the tests that does not work in 3.11.

skshetry avatar Aug 29 '22 14:08 skshetry