dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Simplify execution of various runs with different params

Open behrica opened this issue 3 years ago • 10 comments

As an alternative to repeated (manual) executions of:

dvc exp run -S a=1
dvc exp run -S a=2

it might be usefull (and clean, I think) to allow some way to "pass several param files" (from a folder maybe), and this would "auto-queue" runs automatically accordingly.

behrica avatar Jun 13 '22 17:06 behrica

This is what Facebook's Hydra does, and it really is intuitive and clean.

Houstonwp avatar Jul 23 '22 15:07 Houstonwp

This is what Facebook's Hydra does, and it really is intuitive and clean.

We are currently exploring different ways to integrate with Hydra. This feature is part of the scope

daavoo avatar Jul 28 '22 11:07 daavoo

https://github.com/iterative/dvc/pull/8187 is adding support for using Hydra syntax in --set-param, so:

dvc exp run -S 'a=1,2' --queue

Will put 2 experiments in the queue, that can be later executed with dvc queue start.

daavoo avatar Sep 06 '22 14:09 daavoo

My main usecase for the feature request would be to auto-generate such parameter files. #8187 does not allow this. The parameters are interwoven with the rest of the exp run command.

This would allow to use any algorithm for calculating the concrete parameters, without the need to include all such algorithms in dvc itself.

behrica avatar Sep 07 '22 19:09 behrica

@behrica Would you be interested in a Python API to do this?

My feeling is that if you need to auto-generate all parameter combinations, you may as well call dvc exp run --queue from your code for each parameter combination (have you tried this as a workaround?). Saving them all to different files in a folder seems sort of against DVC expectations, since it is assumed each experiment contains only its own parameters. It also doesn't seem to work for adaptive algorithms, where it's not known from the start every parameter combination that will be tried.

A Python API could add more parameter combinations as you go. It also adds possibilities to do more complex operations like randomly select a parameter from an interval. Maybe we could support that in a way that is broadly useful across any search algorithm?

dberenbaum avatar Sep 08 '22 16:09 dberenbaum

I am very happy that DVC is language independent. I use it from Clojure. So I would favor a command line which takes a file with all parameters combinations I want.

Then I could generate such a file from Clojure

behrica avatar Sep 08 '22 17:09 behrica

But this makes it rather static, indeed.

behrica avatar Sep 08 '22 17:09 behrica

But the workaround you mentioned is feasible as well.

behrica avatar Sep 08 '22 17:09 behrica

I think the general question is to decide on this question :

Should dvc itself start to provide various algorithms to "statically calculate" concrete parameters from "a user supplied parameter space" yes/no

It seems to me that #8187 is a first step in this direction. The user gives the space, and dvc calculates all combinations. (taking a random subset of this would be an other algorithm) (using a https://en.wikipedia.org/wiki/Sobol_sequence is an other optimization) Both only take a subset of all combinations or work with continuous intervals and split them smartly.

To allow a "parameter file" would externalize this and allow to keep it out of dvc. But then #8187 should maybe not be merged.

This does not address yet the question of doing this non static using past results of training for example.

behrica avatar Sep 08 '22 18:09 behrica

I see the "user interface" very similar to #8187

$ dvc exp run -Sfile "param_combinations.csv " --queue     # file being in somehow a table format, maybe csv
Queueing with '{'params.yaml': ['db=mysql', 'schema=warehouse']}'.
Queued experiment '5ab98b8' for future execution.
Queueing with '{'params.yaml': ['db=mysql', 'schema=school']}'.
Queued experiment '57c2fb6' for future execution.
Queueing with '{'params.yaml': ['db=postgresql', 'schema=warehouse']}'.
Queued experiment 'b9d6391' for future execution.
Queueing with '{'params.yaml': ['db=postgresql', 'schema=school']}'.
Queued experiment '145cd55' for future execution.

behrica avatar Sep 08 '22 18:09 behrica

it might be usefull (and clean, I think) to allow some way to "pass several param files" (from a folder maybe)

By the way, this is already possible to do with Hydra. You would save them as YAML files in your conf directory and then select each conf file like dvc exp run -S conf_file=file1,file2. There's a simple example in https://github.com/dberenbaum/hydra-dvc-multirun.

dberenbaum avatar Oct 19 '22 13:10 dberenbaum

it might be usefull (and clean, I think) to allow some way to "pass several param files" (from a folder maybe)

By the way, this is already possible to do with Hydra. You would save them as YAML files in your conf directory and then select each conf file like dvc exp run -S conf_file=file1,file2. There's a simple example in https://github.com/dberenbaum/hydra-dvc-multirun. This syntax does no work for me:

[hydra-dvc-multirun]$ dvc exp run  --queue -S conf_file=one.yaml,two.yaml 
ERROR: unexpected error - Could not override 'conf_file'.             
To append to your config use +conf_file=one.yaml: Key 'conf_file' is not in struct
    full_key: conf_file
    object_type=dict
 dvc exp run  --queue -S conf_file=one.yaml,two.yaml 
ERROR: unexpected error - Could not override 'conf_file'.             
To append to your config use +conf_file=one.yaml: Key 'conf_file' is not in struct
    full_key: conf_file
    object_type=dict

behrica avatar Oct 20 '22 08:10 behrica

Sorry, there is some hydra-specific syntax. You have to use group (since that's the dir inside conf where the files are stored), and you can optionally drop .yaml. See the readme of that repo:

$ dvc exp run --queue -S group=one,two
Queueing with overrides '{'params.yaml': ['group=one']}'.
Queued experiment '634a8fa' for future execution.
Queueing with overrides '{'params.yaml': ['group=two']}'.
Queued experiment '0c283dc' for future execution.

dberenbaum avatar Oct 20 '22 17:10 dberenbaum

I tried it out, and that might work. My use case would be massive grid searches, so I would maybe generate a few thousand files. I could give all of them a random name and list them all in a very long list .... (probable reaching the maximum length of a command line)

I did it now in a complete different way, which is working as well, not using hydra al all.

Basically I loop over all my parameter combinations in code and do:

  1. write param.yaml to disk
  2. shell out and run dvc exp run --queue

This is maybe even good enough for closing this issue.

behrica avatar Oct 21 '22 09:10 behrica

Makes sense @behrica! Yeah, there are too many different ways to do this to have them all be "built in," but glad you found a pattern that works for you.

dberenbaum avatar Oct 21 '22 16:10 dberenbaum