dvc
dvc copied to clipboard
exp run: Support composing and dumping Hydra config.
The feature will be used depending on whether config.hydra.output_file has been set or not.
Uses hydra.initialize_config_dir and hydra.compose (from https://hydra.cc/docs/advanced/compose_api/) to build the config and dump it to config.hydra.output_file.
config.hydra.config_dir and config.hydra.config_name can be used to customize the values passed to hydra.initialize_config_dir and hydra.compose.
Can be combined with --set-param overrides.
Closes #8082
-
[X] β I have followed the Contributing to DVC checklist.
-
[ ] π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Depends on #8067
@dberenbaum please take a look
Nice work @daavoo! I really like how it achieves two important product goals:
- Makes DVC easier and more flexible for complex configuration.
- Makes it easier to track final parameters for Hydra users.
I came across two big questions when trying it:
- Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the
hydraoptions enabled, but I guess this risks leaving in parameters that you wanted to drop π€. - Is this the simplest/best way to handle custom files (not
params.yaml)? The relationship betweenhydra.output_fileand--set-paramseems a bit awkward and hard to wrap my head around. Seems hard to know when--set-parammodifies my Hydra config vs. my params file. If I sethydra.output_filetocustom.yaml, then--set-param key=valwill modifyparams.yamland notcustom.yaml, which I think will confuse Hydra users. Maybe instead ofhydra.output_file, we need an option to configure the default params file (see #7939) and always output there? I think we should at least default to usingparams.yamlforhydra.output_file.
More minor questions:
- Should it work for
reproπ€ ? If not, we should be able to explain. - How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?
Interesting side effect: dvc exp run -S parameters persist to the next experiment by default but Hydra will reset to default each experiment. I prefer the Hydra behavior and wonder how we can simplify the experience enough to make it the default (separate from this PR). π€
Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the
hydraoptions enabled, but I guess this risks leaving in parameters that you wanted to drop π€.
I thought about merging but found that it would be confusing when combined with -set-param to determine which arguments are meant to be passed to hydra and which are meant to override the current status of the output_file.
Perhaps it would be better to clearly establish that hydra output_file will be overwritten on each exp run invocation.
Users can move the hardcoded values of output_file to hydra.config_name and the result would be the same as merging.
Should it work for repro π€ ? If not, we should be able to explain.
It should not, because of the same reason --set-param only works for exp run. We are creating/editing a file and it's required to track the changes with git before running the pipeline
How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?
Did you find any particular issue?
I copy-pasted the structure of the link to multiconf folder.
[hydra]
output_file = params.yaml
config_dir = multiconf
# dvc.yaml
stages:
train:
cmd: python train.py
deps:
- train.py
params:
- params.yaml:
# train.py
import dvc.api
def my_app() -> None:
print(dvc.api.params_show())
if __name__ == "__main__":
my_app()
$ dvc exp run
Running stage 'train':
> python train.py
{'server': {'site': {'fb': {'domain': 'facebook.com'}, 'google': {'domain': 'google.com'}}, 'host': 'localhost', 'port': 443}}
$ dvc exp run -S 'server/site=[google,amazon]'
Running stage 'train':
> python train.py
{'server': {'site': {'google': {'domain': 'google.com'}, 'amazon': {'domain': 'amazon.com'}}, 'host': 'localhost', 'port': 443}}
Is this the simplest/best way to handle custom files (not params.yaml)?
My main motivation for going with hydra.output_file was to make it explicit, as I consider the composing and dumping a somehow aggressive operation. In addition, the compose and dump should also happen without --set-param, so users should have a way of toggling the behavior.
The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users.
Is there something specific that resulted obscure to you? I am not sure how much it comes from --set-param defaulting to params.yaml when no args and how much by adding the hydra.output_file.
What do you think about having a special key for --set-param named hydra (dvc exp run --set-param hydra:foo=1), which would explicitly indicate that those are args to be passed for the hydra compose and dump feature??
This could also alleviate https://github.com/iterative/dvc/pull/8093#issuecomment-1207971509 .
Maybe instead of hydra.output_file, we need an option to configure the default params file (see https://github.com/iterative/dvc/issues/7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.
Not sure I like the idea of entangling the hydra compose and dump with the DVC defaults parameters file.
@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?
@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?
No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.
@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?
No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.
Another issue with configuring at the .dvc level is that it can be impractical in scenarios with multiple dvc.yaml (i.e. monorepo) unless dvc init --subdir is used.
Some ideas for enabling this at the stage level:
- Consider
hydraa reserved key insideparams.
If hydra is present in params of a stage, trigger the compose and dump functionality.
stages:
train:
cmd: python train.py ${train_config}
params:
- hydra
Q) How to configure?
Something like the following might look appealing but completely change the semantics / internal logic of params:
stages:
train:
cmd: python train.py ${train_config}
params:
- hydra:
output_file: params.yaml
config_dir: conf
- Introduce callbacks?
There were old discussions about this. But basically, the hydra compose and dump could be implemented/understood as a stage callback:
stages:
train:
cmd: python train.py ${train_config}
callbacks:
- hydra:
output_file: params.yaml
config_dir: conf
params:
- train_config
Pros and cons of the dvc.yaml approach:
Pros
- More granular control over where and how to use Hydra. For example, I could use the same hydra config_dir in two stages with different output_files for each and specify different
--set-paramvalues for each. - More transparent to other users that Hydra is being used.
Cons
- If reusing the same config across stages or
dvc.yamlfiles, it becomes repetitive (is this what you meant by "impractical in scenarios with multiple dvc.yaml? @daavoo?). For example, if I want to use the same hydra config across many stages, I need to add the hydra-specific section to each one. - Can't be used outside of stages (such as in
foreachor globalvars). This restricts its usage and makes it hard to configure for complex scenarios, which is where Hydra is needed most. - Harder to run without Hydra (more tightly coupled).
IMO the most likely and useful way to integrate Hydra and DVC is to have a massive Hydra config dir that gets reused throughout the project. For existing Hydra users, they are more likely to have a single Hydra config dir, and it's less work to integrate Hydra if it doesn't have to be configured at each stage in the pipeline. For that scenario, configuring at the project level in .dvc/config seems better, but we can also try to get feedback from users here.
For future reference, I set up https://github.com/dberenbaum/complex_config_example when I was testing this out π (edit: main branch uses current dvc functionality, hydra branch uses this PR)
Latest update:
config.hydra.config_diris the config option that enables/disables the hydra compose and dump behavior.- By default, composed config will be dumped to
params.yaml.config.hydra.output_fileis now optional and can be used for changing the default behavior. Contents ofoutput_filewill be overwritten.
- By default, composed config will be dumped to
params.yaml.config.hydra.output_fileis now optional and can be used for changing the default behavior. Contents ofoutput_filewill be overwritten.
I'm worried this will confuse more people than it helps. Once set, it seems hard to explain how it works and clunky to have to use -S custom_params.yaml:... for all Hydra overrides. If you feel it's needed, maybe you have a better idea of how to clearly explain the behavior?
We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in #7939. Is there a use case for having a different default parameters file and Hydra parameters file?
We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in https://github.com/iterative/dvc/issues/7939. Is there a use case for having a different default parameters file and Hydra parameters file?
I don't have a specific use case in mind. I guess I was trying to cover most scenarios. I can remove the option from the P.R. and wait for people to ask (if ever happens)
Hi and thank you for working on this! As a user of both Hydra and DVC I thought it might be helpful to contribute a realistic use case. It may already be handled well by the current implementation, I donβt know.
Hydraβs great for modular projects where interchangeable modules may have heterogeneous configuration options. For example, I may want to use either a random forest or a multi-layer perceptron as my model architecture. Its two config files may look like:
# config_dir/model/rf.yaml
_target_: my_package.models.RandomForest
tree_depth: 3
# config_dir/model/mlp.yaml
_target_: my_package.models.MLP
n_hidden: 128
Iβd like to now run a sequence of experiments.
- RF with
tree_depth=4. - MLP with
n_hidden=192 - RF with
tree_depth=4, again (I forgot about running it earlier, say)
The final experiment should be immediately recovered from cache but experiment 2 may have overwritten the value of model.n_hidden inside whatever fileβs tracking DVC params. So while 1 was run with model.tree_depth=4 model.n_hidden=128, 3 will be instantiated with model.tree_depth=4 model.n_hidden=192. But the two settings describe fundamentally the same experiment, since n_hidden is not used by RF at all.
Is this case correctly handled by your parameter storage?
Is this case correctly handled by your parameter storage?
@d-miketa I think so, with proper setup. I am making some assumptions here based on what you shared (like the use of hydra.utils.instantiate).
Here is the setup:
`conf` directory
$ tree conf
conf
βββ config.yaml
βββ model
βββ mlp.yaml
βββ rf.yaml
$ cat conf/config.yaml
defaults:
- model: rf
$ cat conf/model/mlp.yaml
_target_: models.MLP
n_hidden: 128
$ cat conf/model/rf.yaml
_target_: models.RandomForest
tree_depth: 3
`models/__init__.py`
@dataclass
class MLP:
n_hidden: int
@dataclass
class RandomForest:
tree_depth: int
`train.py`
import dvc.api
from hydra.utils import instantiate
def my_app() -> None:
model = instantiate(dvc.api.params_show()["model"])
print(model)
if __name__ == "__main__":
my_app()
`dvc.yaml`
stages:
train:
cmd: python train.py
params:
- params.yaml:
So, with that, I run:
dvc config hydra.config_dir conf
$ dvc exp run -S model.tree_depth=4
...
> python train.py
RandomForest(tree_depth=4)
...
$ dvc exp run -S model=mlp -S model.n_hidden=192
...
> python train.py
MLP(n_hidden=192)
...
$ dvc exp run -S model.tree_depth=5
...
> python train.py
RandomForest(tree_depth=5)
...
And:
$ dvc exp show
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Experiment Created model._target_ model.tree_depth model.n_hidden
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
workspace - models.RandomForest 5 -
master 11:47 AM models.RandomForest 3 -
βββ ea452af [exp-bf727] 11:48 AM models.RandomForest 5 -
βββ bb97bdf [exp-c8cde] 11:48 AM models.MLP - 192
βββ 730ecec [exp-5b346] 11:48 AM models.RandomForest 4 -
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
config.hydra.config_diris the config option that enables/disables the hydra compose and dump behavior.
Sorry to come back to this so late, but I'm still slightly confused.
hydra.config_dir acts as both a way to turn on the feature and to set the dir. However, hydra.config_name is optional, has a default value, and doesn't impact whether the feature is turned on.
This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:
- Making a separate flag to turn the feature on and make both of these optional with defaults (
conf/config). - Requiring both flags to be set to turn on the behavior, and raising a helpful error if only one is set.
This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:
- Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).
I think this one option has the cleanest UX. Added config.hydra.enabled and made config_dir and config_name both optional with defaults
@daavoo Ahhh thatβs amazing! So good to see the integrationβs naturally taking care of this problem - you clearly made some sound architectural choices. π
@iterative/dvc Could someone review please?
@dtrifiro @skshetry Could you both take a look, please?
@daavoo, I have rebased it to main, and added some skips to the tests that does not work in 3.11.