dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc repro: Pipelines (templating) and parameters - dvc looks in different directories for them?

Open philipp-kohl opened this issue 3 years ago • 7 comments
trafficstars

Bug Report

Description

Hi, I have a question regarding dvc parameters and pipelines (templating): We have a project where we separated the pipeline into different files (piplelines/pipeA/dvc.yaml, pipelines/pipeB/dvc.yaml, ...) and we wanted to incorporate params into the pipeline. And we faced several error messages that the parameters could not be found. And thus, we reproduced the issue with this example project: https://github.com/iterative/example-get-started

The main question is: Why is dvc searching for the template parameters in the same directory (e.g. piplelines/pipeA/) and for other parameters in the root? Why not both in root? Does the wdir in the stages affect something unintentional or unexpected for us? I do not want to have this additional param.yaml file in each subfolder for all splitted pipelines.

Reproduce

  1. We cloned the repo (https://github.com/iterative/example-get-started) and followed the instructions. Everything works fine.
  2. Now we want to use templating (https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating) to use parameters from params.yaml in our commands. We just testet it with following modification: dvc.yaml:
    prepare:
    cmd: python src/prepare.py ${paths.test} #This param is new
    
    params.yaml:
    ... # Rest is untouched, just added following lines
    paths:
        test: data/data.xml
    
    Works fine either.
  3. Now we want to move the pipeline to a subfolder but keep the params.yaml in the root folder. Thus we move dvc.yaml to a new folder "pipelines". Running dvc repro results in ERROR: failed to parse 'stages.prepare.cmd' in 'pipelines/dvc.yaml': Could not find 'paths.test'
  4. Okay, then we try to move the params.yaml to the dvc.yaml -> move params.yaml to pipelines folder. Run dvc repro. Results in ERROR: failed to reproduce 'pipelines/dvc.yaml': [Errno 2] No such file or directory: 'xyz/example-get-started/pipelines/data/data.xml'. We assume that the parameters can be resolved now. But of course the working directory is wrong.
  5. So, let's change the wdir in each stage to "../". Run dvc repro: ERROR: failed to reproduce 'pipelines/dvc.yaml': Parameters 'prepare.split, prepare.seed' are missing from 'params.yaml'. DVC seems to find the paths.test but not the other parameters?
  6. Moving params.yaml back to root folder results in not finding paths.test again.
  7. So, now what seems to work is to split the params.yaml file.
    1. In root folder: with all params.yaml like in repo. Thus, without our introduced paths.test.
    2. In pipelines folder: just with our paths.test parameter And it works.

Why is dvc searching for the template parameters (${paths.test}) in the same directory and for other parameters in the root? Why not both in root? Does the wdir in the stages affect something unintentional or unexpected for us? I do not want to have this additional param.yaml file in each subfolder for all splitted pipelines with the same content.

Expected

We would expect that we can move the dvc.yaml files to an arbitrary folder in the project, but the first location where dvc searchs for params.yaml should be the root folder of the project. Especially we would expect that dvc searchs for params.yaml for parameters and templating in the same location.

Environment information

Output of dvc doctor:

DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.12 on Linux-5.4.0-96-generic-x86_64-with-glibc2.17
Supports:
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p3
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/nvme0n1p3
Repo: dvc, git

philipp-kohl avatar Jan 28 '22 14:01 philipp-kohl

Hi, I had a little bit of time and have done some debugging to trace down the cause. Maybe it can help. At first there seems to be a difference in handling templating (e.g. using ${wdir} in dvc.yaml) and using parameters (e.g. params: -params.seed in dvc.yaml).

  1. Templating The DataResolverhttps://github.com/iterative/dvc/blob/6c8673b1582fcf1a92264b90885e8c4f8df4f63d/dvc/parsing/init.py#L147-L148 tries to fill the context via the params.yaml in the working directory which is set to the location of the pipeline file. In my constructed example it would be example-get-started/pipelines. Thus, if I have just the params.yaml in the root folder, we get the following info: 2022-02-02 08:42:42,958 TRACE: /example-get-started/pipelines/params.yaml does not exist, it won't be used in parametrization 2022-02-02 08:42:43,174 TRACE: Context during resolution of stage prepare: {} If we add the params.yaml to the pipelines folder, the context is filled: 2022-02-02 08:44:02,669 TRACE: Context during resolution of stage prepare: {'paths': {'test': 'data/data.xml'}} We see that we just have the data from the /example-get-started/pipelines/params.yaml file and not from the root folder.

  2. Parameters I just deleted the params.yaml in the root folder a get the MissingParamsError in ParamsDependency. If you take a look at the constructor ParamsDependency,https://github.com/iterative/dvc/blob/6c8673b1582fcf1a92264b90885e8c4f8df4f63d/dvc/dependency/param.py#L38-L42 path is just filled with params.yaml. Thus, dvc searches just in the root folder. Due to this, dvc would not notice the params even if they are present in the pipeline subfolder, because we have a separation between templating und parameters.

Conclusion

My conclusion would be that templating is searching params file in the same directory of the pipelines due to the method argument wdir and parameters are searched in the root folder.

Proposal

A maybe easy solution could be that the templating mechanism (in the DataResolver) tries at first to look in the same directory as the dvc.yaml. If this fails, we could try to search in the root folder. The behaviour for the parameters seems okay for me, or could there be a need for splitting up the parameters into different files. But I think the parameters have to have different names, so that dvc can track them properly.

philipp-kohl avatar Feb 02 '22 08:02 philipp-kohl

@philipp-kohl, the parameters are loaded relative to the dvc.yaml file, so they are not loaded from the root of the repo.

On other locations, you may need to explicitly specify the location of that parameters file:

vars:
  - ../params.yaml

skshetry avatar Feb 11 '22 08:02 skshetry

@skshetry My understanding of the issue is that wdir is inconsistent between params and templating substitutions.

Following the example from @philipp-kohl, I end up with a pipelines/dvc.yaml that looks like:

stages:
  prepare:
    wdir: "../"
    cmd: python src/prepare.py ${paths.data}
    deps:
    - data/data.xml
    - src/prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared
...

If params.yaml is in the wdir (which happens to be the root of the repo), I still get ERROR: failed to parse 'stages.prepare.cmd' in 'pipelines/dvc.yaml': Could not find 'paths.data'. Is this expected? I don't see it noted as a limitation in https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating.

dberenbaum avatar Feb 11 '22 15:02 dberenbaum

Hey everyone, we also would like to put params.yaml and dvc.yaml files into a config sub folder. What is the status of this issue?

haimat avatar Apr 18 '23 11:04 haimat

@skshetry Thoughts on this?

dberenbaum avatar Apr 21 '23 21:04 dberenbaum

Hey, guys, my team also would find it useful to have params.yaml and dvc.yaml in the same folder. Something like

contexts
└── <scope>
    └── dvc.yaml

where we could run dvc exp run -s contexts/<scope>/dvc.yaml:<stage> so that dvc.lock is created alongside dvc.yaml and params.yaml.

We use hydra composition too.

Right now we have only one dvc.yaml and params.yaml at root, but this complicates things when we need to change context. For example, we usually use the output of one context with another, but the dvc.lock being at the root prohibits this.

What is the status here? Is there any alternative while it isn't completed, if so? Thanks!

vitalwarley avatar Nov 27 '23 17:11 vitalwarley

Hey, guys, my team also would find it useful to have params.yaml and dvc.yaml in the same folder. Something like

contexts
└── <scope>
    └── dvc.yaml

@vitalwarley Do you also want params.yaml in that same subfolder as dvc.yaml? That should work fine already. The issue here is specific to combining wdir with params.yaml and templating.

dberenbaum avatar Dec 05 '23 18:12 dberenbaum