guide: explain usage of multiple dvc.yaml files
In #1641, it was added that multiple dvc.yaml files are supported. I think it would be good to give extra information on how this works and even encourage it where relevant.
Specifically one or more of the following:
dvc.yamlfiles can be in any subdirectory or nested subdirectory in the project structure and DVC will find them- DVC will process them just the same as if they were one DVC file i.e. dependencies between stages in different
dvc.yamlfiles are still respected - Each
dvc.yamlfile will have its owndvc.lockfile in the same directory - Splitting a
dvc.yamlfile into multiple files is encouraged where there are clear logical groupings between stages. It avoids confusion, improves readability and shortens commands by avoiding long paths preceding every filename
Other Details
(Added by @shcheklein)
- you need to use
--all-pipelinesor--recursiveto find and run all pipelines - a particular pipeline
dvc.yamlcan be run withdvc exp run pipeline1/dvc.yamlorcd pipeline1; dvc exp run(works fordvc reproas well) - each subdirectory could have its own
params.yamlthat will be used as a default params file for a particular pipeline
Example
An artificial example. We should modify it a bit to be more realistic when we write docs:
Example
(.venv) √ Projects/test-pipelines % tree .
.
├── pipeline1
│  ├── dvc.lock
│  ├── dvc.yaml
│  └── params.yaml
└── pipeline2
├── dvc.lock
├── dvc.yaml
└── params.yaml
2 directories, 6 files
(.venv) √ Projects/test-pipelines % cat pipeline1/dvc.yaml
stages:
p1-echo:
cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline1/params.yaml
v: 1
(.venv) √ Projects/test-pipelines % cat pipeline2/dvc.yaml
stages:
p2-echo:
cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline2/params.yaml
v: 2
To reference a param file in a different directory, try an explicit syntax for param files:
params:
- params.yaml:
within a stage, or globally per dcv.yaml.
### Tasks
- [ ] Modify the [pipelines](https://dvc.org/doc/user-guide/project-structure/pipelines-files) page with this information
@amin-nejad this an excellent summary!
Thanks @shcheklein!
A mention of the "--all-pipelines" argument to dvc repro would have helped me. Took some Discuss searching to understand how to get the nested dvc.yaml files to go with dvc repro --all-pipelines. An explanation of "--recursive" could help too (I for one, don't understand the help).
How does one reference parameters when having multiple dvc.yaml files? Should there be one params.yaml file in the same directory as each dvc.yaml? If so, how to reference parameters from parameter files in other directories?
@JulianoLagana here is a very brief and small example that I tested:
(.venv) √ Projects/test-pipelines % tree .
.
├── pipeline1
│  ├── dvc.lock
│  ├── dvc.yaml
│  └── params.yaml
└── pipeline2
├── dvc.lock
├── dvc.yaml
└── params.yaml
2 directories, 6 files
(.venv) √ Projects/test-pipelines % cat pipeline1/dvc.yaml
stages:
p1-echo:
cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline1/params.yaml
v: 1
(.venv) √ Projects/test-pipelines % cat pipeline2/dvc.yaml
stages:
p2-echo:
cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline2/params.yaml
v: 2
To reference a param file in a different directory, try an explicit syntax for param files:
params:
- params.yaml:
within a stage, or globally per dcv.yaml.
I did create a directory tree like the one mentioned above. How can I choose to run only one of them? Will dvc repro --all-pipelines run them all? I want to select only one to run, how can I do that? lets say I want to run pipeline1/dvc.yaml only
@amdsobhy :
One way to run it is to do:
$ cd pipeline1
$ dvc repro or dvc exp run
Another way to do this:
$ dvc repro pipeline1/dvc.yaml
or
$ dvc exp run pipeline1/dvc.yaml
@shcheklein Thank you for your answer. I tried
dvc repro pipeline1/dvc.yaml
before but did not work for some reason and I think this might be because I moved the dvc.yaml from its original location in the root directory.
So lets say I currently have one dvc.yaml along with a dvc.lock file in the root directory of my repo ~/repo, and I want to move the files to ~/repo/pipeline1. Do I need to move the dvc.lock file as well? How should I make this transition? Also I have already finished training while the dvc.yaml was at ~/repo/dvc.yaml and I do not want to retrain. I just want to relocate the files for future training and to combine multiple models in the same repo
Do I need to move the dvc.lock file as well?
Yes, if it's a heavy pipeline and you don't want to run it again. If you need to change dvc.yaml in the process you could run dvc commit at the end (assuming that you moved all the outputs, metrics, etc and you are sure that it is exactly what should be produced) to save the time and avoid running it again.
How should I make this transition?
Moving files is fine. One thing you would need to check and potentially change, or also move - are paths to different dependencies, outputs, etc. You might need update them, or move some additional files, etc. It really depends on the dvc.yaml.
When editing paths in dvc.yaml and dvc.lock are the paths relative to the root directory of the repo or relative to the location of the dvc.yaml file or relatve to where I execute the dvc repro command?
for example I have my output in
~/repo/output/pipeline1/p1.weights
and I currently have my dvc file in
~/repo/dvc/pipeline1/dvc.yaml
before I had the output as following:
outs:
- output/pipeline1/p1.weights
Should the new output path be:
outs:
- dvc/../output/pipeline1/p1.weights
so that it is relatve to the new dvc file location?
Right now when I try to run dvc status dvc/pipeline1/dvc.yaml, it reports back that the files are deleted because it is looking for them inside the ~/repo/dvc directory while they are one level up
When editing paths in dvc.yaml and dvc.lock are the paths relative to the root directory of the repo or relative to the location of the dvc.yaml file or relatve to where I execute the dvc repro command?
I think they are relative to the dvc.yaml location.
Should the new output path be:
Looks like it should be ../../output/pipeline1/p1.weights?
Optional, and only if it's needed - there are a few ways to manipulate this. Use wdir in the stage definition. Also you could use dvc root command to get the root of the project and then compose stable path.
I really appreciate your help @shcheklein Thank you so much.
Yes I made a mistake its two levels up.
Is it possible the "wdir" be set as a global in dvc.yaml?
Also what about the paths in the dvc.lock file? do I need to manully modify them as well if I do not run the pipeline? and when modifying the dvc.lock file is the wdir variable recognized in this file
I noticed the paths in the dvc.lock file are the old ones.
Is it possible the "wdir" be set as a global in dvc.yaml?
No, not at the moment :(
Also what about the paths in the dvc.lock file? do I need to manully modify them as well if I do not run the pipeline? and when modifying the dvc.lock file is the wdir variable recognized in this file
You can run dvc commit I think to forcefully recreate the lock file.
Adding back as a p1 since it relates to general monorepo usage, which we are seeing is increasingly common
Another topic to cover here is how to view experiment results when there are multiple pipelines or projects. From a recent email response:
With the command line and VS Code extension, you can filter the columns to only those relevant to that pipeline. For example, to only show pipeline1, you might do something like
dvc exp show --drop 'pipeline1.*'. In VS Code, you can duplicate the workspace so that you have a window open for each pipeline.If you use DVC Studio, you can configure a project directory and have a project for each pipeline without having to manually configure the columns.