dvc.org icon indicating copy to clipboard operation
dvc.org copied to clipboard

guide: explain usage of multiple dvc.yaml files

Open amin-nejad opened this issue 4 years ago • 15 comments

In #1641, it was added that multiple dvc.yaml files are supported. I think it would be good to give extra information on how this works and even encourage it where relevant.

Specifically one or more of the following:

  • dvc.yaml files can be in any subdirectory or nested subdirectory in the project structure and DVC will find them
  • DVC will process them just the same as if they were one DVC file i.e. dependencies between stages in different dvc.yaml files are still respected
  • Each dvc.yaml file will have its own dvc.lock file in the same directory
  • Splitting a dvc.yaml file into multiple files is encouraged where there are clear logical groupings between stages. It avoids confusion, improves readability and shortens commands by avoiding long paths preceding every filename

Other Details

(Added by @shcheklein)

  • you need to use --all-pipelines or --recursive to find and run all pipelines
  • a particular pipeline dvc.yaml can be run with dvc exp run pipeline1/dvc.yaml or cd pipeline1; dvc exp run (works for dvc repro as well)
  • each subdirectory could have its own params.yaml that will be used as a default params file for a particular pipeline

Example

An artificial example. We should modify it a bit to be more realistic when we write docs:

Example
(.venv) √ Projects/test-pipelines % tree .
.
├── pipeline1
│   ├── dvc.lock
│   ├── dvc.yaml
│   └── params.yaml
└── pipeline2
    ├── dvc.lock
    ├── dvc.yaml
    └── params.yaml

2 directories, 6 files
(.venv) √ Projects/test-pipelines % cat pipeline1/dvc.yaml
stages:
  p1-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline1/params.yaml
v: 1
(.venv) √ Projects/test-pipelines % cat pipeline2/dvc.yaml
stages:
  p2-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline2/params.yaml
v: 2

To reference a param file in a different directory, try an explicit syntax for param files:

      params:
        - params.yaml:

within a stage, or globally per dcv.yaml.

### Tasks
- [ ] Modify the [pipelines](https://dvc.org/doc/user-guide/project-structure/pipelines-files) page with this information

amin-nejad avatar May 21 '21 10:05 amin-nejad

@amin-nejad this an excellent summary!

shcheklein avatar May 21 '21 18:05 shcheklein

Thanks @shcheklein!

amin-nejad avatar May 22 '21 12:05 amin-nejad

A mention of the "--all-pipelines" argument to dvc repro would have helped me. Took some Discuss searching to understand how to get the nested dvc.yaml files to go with dvc repro --all-pipelines. An explanation of "--recursive" could help too (I for one, don't understand the help).

itcarroll avatar Apr 12 '22 18:04 itcarroll

How does one reference parameters when having multiple dvc.yaml files? Should there be one params.yaml file in the same directory as each dvc.yaml? If so, how to reference parameters from parameter files in other directories?

JulianoLagana avatar Feb 22 '23 10:02 JulianoLagana

@JulianoLagana here is a very brief and small example that I tested:

(.venv) √ Projects/test-pipelines % tree .
.
├── pipeline1
│   ├── dvc.lock
│   ├── dvc.yaml
│   └── params.yaml
└── pipeline2
    ├── dvc.lock
    ├── dvc.yaml
    └── params.yaml

2 directories, 6 files
(.venv) √ Projects/test-pipelines % cat pipeline1/dvc.yaml
stages:
  p1-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline1/params.yaml
v: 1
(.venv) √ Projects/test-pipelines % cat pipeline2/dvc.yaml
stages:
  p2-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline2/params.yaml
v: 2

To reference a param file in a different directory, try an explicit syntax for param files:

      params:
        - params.yaml:

within a stage, or globally per dcv.yaml.

shcheklein avatar Feb 23 '23 01:02 shcheklein

I did create a directory tree like the one mentioned above. How can I choose to run only one of them? Will dvc repro --all-pipelines run them all? I want to select only one to run, how can I do that? lets say I want to run pipeline1/dvc.yaml only

amdsobhy avatar Feb 25 '23 01:02 amdsobhy

@amdsobhy :

One way to run it is to do:

$ cd pipeline1
$ dvc repro or dvc exp run

Another way to do this:

$ dvc repro pipeline1/dvc.yaml

or

$ dvc exp run pipeline1/dvc.yaml

shcheklein avatar Feb 25 '23 02:02 shcheklein

@shcheklein Thank you for your answer. I tried

dvc repro pipeline1/dvc.yaml

before but did not work for some reason and I think this might be because I moved the dvc.yaml from its original location in the root directory.

So lets say I currently have one dvc.yaml along with a dvc.lock file in the root directory of my repo ~/repo, and I want to move the files to ~/repo/pipeline1. Do I need to move the dvc.lock file as well? How should I make this transition? Also I have already finished training while the dvc.yaml was at ~/repo/dvc.yaml and I do not want to retrain. I just want to relocate the files for future training and to combine multiple models in the same repo

amdsobhy avatar Feb 25 '23 15:02 amdsobhy

Do I need to move the dvc.lock file as well?

Yes, if it's a heavy pipeline and you don't want to run it again. If you need to change dvc.yaml in the process you could run dvc commit at the end (assuming that you moved all the outputs, metrics, etc and you are sure that it is exactly what should be produced) to save the time and avoid running it again.

How should I make this transition?

Moving files is fine. One thing you would need to check and potentially change, or also move - are paths to different dependencies, outputs, etc. You might need update them, or move some additional files, etc. It really depends on the dvc.yaml.

shcheklein avatar Feb 25 '23 17:02 shcheklein

When editing paths in dvc.yaml and dvc.lock are the paths relative to the root directory of the repo or relative to the location of the dvc.yaml file or relatve to where I execute the dvc repro command?

for example I have my output in

~/repo/output/pipeline1/p1.weights

and I currently have my dvc file in

~/repo/dvc/pipeline1/dvc.yaml

before I had the output as following:

outs:
- output/pipeline1/p1.weights

Should the new output path be:

outs:
- dvc/../output/pipeline1/p1.weights

so that it is relatve to the new dvc file location?

Right now when I try to run dvc status dvc/pipeline1/dvc.yaml, it reports back that the files are deleted because it is looking for them inside the ~/repo/dvc directory while they are one level up

amdsobhy avatar Feb 25 '23 18:02 amdsobhy

When editing paths in dvc.yaml and dvc.lock are the paths relative to the root directory of the repo or relative to the location of the dvc.yaml file or relatve to where I execute the dvc repro command?

I think they are relative to the dvc.yaml location.

Should the new output path be:

Looks like it should be ../../output/pipeline1/p1.weights?


Optional, and only if it's needed - there are a few ways to manipulate this. Use wdir in the stage definition. Also you could use dvc root command to get the root of the project and then compose stable path.

shcheklein avatar Feb 25 '23 19:02 shcheklein

I really appreciate your help @shcheklein Thank you so much.

Yes I made a mistake its two levels up.

Is it possible the "wdir" be set as a global in dvc.yaml?

Also what about the paths in the dvc.lock file? do I need to manully modify them as well if I do not run the pipeline? and when modifying the dvc.lock file is the wdir variable recognized in this file

I noticed the paths in the dvc.lock file are the old ones.

amdsobhy avatar Feb 25 '23 19:02 amdsobhy

Is it possible the "wdir" be set as a global in dvc.yaml?

No, not at the moment :(

Also what about the paths in the dvc.lock file? do I need to manully modify them as well if I do not run the pipeline? and when modifying the dvc.lock file is the wdir variable recognized in this file

You can run dvc commit I think to forcefully recreate the lock file.

shcheklein avatar Feb 25 '23 19:02 shcheklein

Adding back as a p1 since it relates to general monorepo usage, which we are seeing is increasingly common

dberenbaum avatar Aug 16 '23 20:08 dberenbaum

Another topic to cover here is how to view experiment results when there are multiple pipelines or projects. From a recent email response:

With the command line and VS Code extension, you can filter the columns to only those relevant to that pipeline. For example, to only show pipeline1, you might do something like dvc exp show --drop 'pipeline1.*'. In VS Code, you can duplicate the workspace so that you have a window open for each pipeline.

If you use DVC Studio, you can configure a project directory and have a project for each pipeline without having to manually configure the columns.

dberenbaum avatar Oct 27 '23 13:10 dberenbaum