dvc
dvc copied to clipboard
flatten `deps` to facilitate more flexible templating
The templating feature currently admits little flexibility for stage keys (e.g. deps
) that expect lists. As a result, foreach
stages cannot be given a variable number of dependencies. Causing lists to be flattened would make templating more powerful. For example, you could give extra dependencies to certain templated stages:
stages:
build:
foreach:
us:
deps:
uk:
deps:
- uk_data
do:
cmd: myscript ${key}
deps:
- myscript
- ${item.deps}
I have implemented the change at https://github.com/itcarroll/dvc/blob/feature/flattened-deps/dvc/dependency/init.py#L52, by (non-recursive) list flattening in dvc.dependency.loads_from
. I am sure it's not the right way to do it, and probably has unintended consequences, but it demonstrates the concept (and no tests failed).
I don't know how templating responds to missing keys, but not having to include the empty list for us
would be even more better.
In ancient discussions on foreach
I found a suggestion (by @jorgeorpinel):
I wonder actually if you can just do deps: ${list} (a list from params.yaml)
This would add the same flexibility I'm interested in with the above, but it fails YAML validation, which appears to require deps
to get a list before ${item}
is expanded. So, some change in the validation process might be an alternate route to this feature.
Hi, I'm also interested in this feature https://discord.com/channels/485586884165107732/485596304961962003/946817686220984350
data:
foreach:
librispeech:
script: libri.sh
links:
- https://www.openslr.org/resources/12/dev-clean.tar.gz
- https://www.openslr.org/resources/12/dev-other.tar.gz
- https://www.openslr.org/resources/12/test-clean.tar.gz
- https://www.openslr.org/resources/12/test-other.tar.gz
- https://www.openslr.org/resources/12/train-clean-100.tar.gz
- https://www.openslr.org/resources/12/train-clean-360.tar.gz
- https://www.openslr.org/resources/12/train-other-500.tar.gz
tedlium:
script: tedlium.sh
links:
- https://www.openslr.org/resources/51/TEDLIUM_release-3.tgz
do:
cmd:
- ./proc/${item.script} --output ${nas_data_dir}/${key} --tmp ${nas_data_dir} ${item.links}
deps:
- ${item.script}
- ${item.links}
outs:
- data/${key}/${key}_text
- data/${key}/${key}_utt2spk
- data/${key}/${key}_wav.scp
- data/${key}/${key}_waves
I have stages that have different list lengths
So, some change in the validation process might be an alternate route to this feature.
This relates to other requests where templating is being limited by the YAML validation which occurs before the interpolation, for example https://github.com/iterative/dvc/issues/6989
@daavoo Do you know if these scenarios work fine if not for YAML validation failures?
@daavoo Do you know if these scenarios work fine if not for YAML validation failures?
For #6989 yes, for this issue additional work would be needed
If all the deps are a single list (deps: ${those}
, as suggested in #8171), is additional work needed?
Big up-vote for this one 👍
It would be really, really helpful.
Is there any progress on this issue? It really is a much-needed feature.
@bezbahen0 We have not made any progress yet.
@skshetry @daavoo Can we resolve the templating before doing validation? If that's not straightforward, can we skip validation errors triggered by unresolved interpolation?
@skshetry @daavoo Can we resolve the templating before doing validation? If that's not straightforward, can we skip validation errors triggered by unresolved interpolation?
Will take a look, it has been a while 😅
This feature would greatly simplify our current pipeline specification!
We don't have a good way to validate after templating, say, for example, to determine that deps
has to be a list of strings/file-paths after the interpolation.
That's why we limited templated dvc.yaml and the resolved dvc.yaml to follow the same schema.
Will take a look, it has been a while 😅
I already ended up taking a look and might open a PR for discussion.