dvc
dvc copied to clipboard
dvc repro: Pipelines (templating) and parameters - dvc looks in different directories for them?
Bug Report
Description
Hi, I have a question regarding dvc parameters and pipelines (templating): We have a project where we separated the pipeline into different files (piplelines/pipeA/dvc.yaml, pipelines/pipeB/dvc.yaml, ...) and we wanted to incorporate params into the pipeline. And we faced several error messages that the parameters could not be found. And thus, we reproduced the issue with this example project: https://github.com/iterative/example-get-started
The main question is: Why is dvc searching for the template parameters in the same directory (e.g. piplelines/pipeA/) and for other parameters in the root? Why not both in root? Does the wdir in the stages affect something unintentional or unexpected for us? I do not want to have this additional param.yaml file in each subfolder for all splitted pipelines.
Reproduce
- We cloned the repo (https://github.com/iterative/example-get-started) and followed the instructions. Everything works fine.
- Now we want to use templating (https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating) to use parameters from params.yaml in our commands.
We just testet it with following modification:
dvc.yaml:
params.yaml:prepare: cmd: python src/prepare.py ${paths.test} #This param is new
Works fine either.... # Rest is untouched, just added following lines paths: test: data/data.xml - Now we want to move the pipeline to a subfolder but keep the params.yaml in the root folder. Thus we move dvc.yaml to a new folder "pipelines". Running
dvc reproresults inERROR: failed to parse 'stages.prepare.cmd' in 'pipelines/dvc.yaml': Could not find 'paths.test' - Okay, then we try to move the params.yaml to the dvc.yaml -> move params.yaml to pipelines folder. Run
dvc repro. Results inERROR: failed to reproduce 'pipelines/dvc.yaml': [Errno 2] No such file or directory: 'xyz/example-get-started/pipelines/data/data.xml'. We assume that the parameters can be resolved now. But of course the working directory is wrong. - So, let's change the wdir in each stage to "../". Run
dvc repro:ERROR: failed to reproduce 'pipelines/dvc.yaml': Parameters 'prepare.split, prepare.seed' are missing from 'params.yaml'.DVC seems to find the paths.test but not the other parameters? - Moving params.yaml back to root folder results in not finding paths.test again.
- So, now what seems to work is to split the params.yaml file.
- In root folder: with all params.yaml like in repo. Thus, without our introduced paths.test.
- In pipelines folder: just with our paths.test parameter And it works.
Why is dvc searching for the template parameters (${paths.test}) in the same directory and for other parameters in the root? Why not both in root? Does the wdir in the stages affect something unintentional or unexpected for us? I do not want to have this additional param.yaml file in each subfolder for all splitted pipelines with the same content.
Expected
We would expect that we can move the dvc.yaml files to an arbitrary folder in the project, but the first location where dvc searchs for params.yaml should be the root folder of the project. Especially we would expect that dvc searchs for params.yaml for parameters and templating in the same location.
Environment information
Output of dvc doctor:
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.12 on Linux-5.4.0-96-generic-x86_64-with-glibc2.17
Supports:
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p3
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/nvme0n1p3
Repo: dvc, git
Hi, I had a little bit of time and have done some debugging to trace down the cause. Maybe it can help. At first there seems to be a difference in handling templating (e.g. using ${wdir} in dvc.yaml) and using parameters (e.g. params: -params.seed in dvc.yaml).
-
Templating The DataResolverhttps://github.com/iterative/dvc/blob/6c8673b1582fcf1a92264b90885e8c4f8df4f63d/dvc/parsing/init.py#L147-L148 tries to fill the context via the params.yaml in the working directory which is set to the location of the pipeline file. In my constructed example it would be
example-get-started/pipelines. Thus, if I have just the params.yaml in the root folder, we get the following info:2022-02-02 08:42:42,958 TRACE: /example-get-started/pipelines/params.yaml does not exist, it won't be used in parametrization2022-02-02 08:42:43,174 TRACE: Context during resolution of stage prepare: {} If we add the params.yaml to the pipelines folder, the context is filled:2022-02-02 08:44:02,669 TRACE: Context during resolution of stage prepare: {'paths': {'test': 'data/data.xml'}}We see that we just have the data from the /example-get-started/pipelines/params.yaml file and not from the root folder. -
Parameters I just deleted the params.yaml in the root folder a get the
MissingParamsErrorin ParamsDependency. If you take a look at the constructor ParamsDependency,https://github.com/iterative/dvc/blob/6c8673b1582fcf1a92264b90885e8c4f8df4f63d/dvc/dependency/param.py#L38-L42 path is just filled withparams.yaml. Thus, dvc searches just in the root folder. Due to this, dvc would not notice the params even if they are present in the pipeline subfolder, because we have a separation between templating und parameters.
Conclusion
My conclusion would be that templating is searching params file in the same directory of the pipelines due to the method argument wdir and parameters are searched in the root folder.
Proposal
A maybe easy solution could be that the templating mechanism (in the DataResolver) tries at first to look in the same directory as the dvc.yaml. If this fails, we could try to search in the root folder. The behaviour for the parameters seems okay for me, or could there be a need for splitting up the parameters into different files. But I think the parameters have to have different names, so that dvc can track them properly.
@philipp-kohl, the parameters are loaded relative to the dvc.yaml file, so they are not loaded from the root of the repo.
On other locations, you may need to explicitly specify the location of that parameters file:
vars:
- ../params.yaml
@skshetry My understanding of the issue is that wdir is inconsistent between params and templating substitutions.
Following the example from @philipp-kohl, I end up with a pipelines/dvc.yaml that looks like:
stages:
prepare:
wdir: "../"
cmd: python src/prepare.py ${paths.data}
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
...
If params.yaml is in the wdir (which happens to be the root of the repo), I still get ERROR: failed to parse 'stages.prepare.cmd' in 'pipelines/dvc.yaml': Could not find 'paths.data'. Is this expected? I don't see it noted as a limitation in https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating.
Hey everyone, we also would like to put params.yaml and dvc.yaml files into a config sub folder.
What is the status of this issue?
@skshetry Thoughts on this?
Hey, guys, my team also would find it useful to have params.yaml and dvc.yaml in the same folder. Something like
contexts
└── <scope>
└── dvc.yaml
where we could run dvc exp run -s contexts/<scope>/dvc.yaml:<stage> so that dvc.lock is created alongside dvc.yaml and params.yaml.
We use hydra composition too.
Right now we have only one dvc.yaml and params.yaml at root, but this complicates things when we need to change context. For example, we usually use the output of one context with another, but the dvc.lock being at the root prohibits this.
What is the status here? Is there any alternative while it isn't completed, if so? Thanks!
Hey, guys, my team also would find it useful to have
params.yamlanddvc.yamlin the same folder. Something likecontexts └── <scope> └── dvc.yaml
@vitalwarley Do you also want params.yaml in that same subfolder as dvc.yaml? That should work fine already. The issue here is specific to combining wdir with params.yaml and templating.