Allow to skip certain stages when dependencies are missing
If we add a new dataset, we want to run all the data processing steps, but for example skip the evaluation as we do not have labels yet. We still want to utilize the foreach functionality to iterate through our different datasets.
For this scenario, it would be helpful to have a kind of deps which specifies "if not present, skip the stage instead of throwing an error, but behave just as normal deps otherwise". Currently, we have to add dummy files manually in order to be able to run dvc repro.
We can use dvc repro --keep-going for now, but this does not differentiate between missing dependencies and other errors that might occur. Also, in cases like my example above, we want to treat the current state of the pipeline as clean.
How about --allow-missing? can it be applied in this case, @Luux ? https://dvc.org/doc/command-reference/repro#example-only-pull-pipeline-data-as-needed
Why is it not enough to specify a target like dvc repro train so that the pipeline stops before the eval stage?
@dberenbaum We want the entire pipeline to be clean. The idea of our pipeline is that we want to define everything relevant in our pipeline configuration (dvc.yaml), so that we just need to run dvc repro and do not have to think about anything else to get our data up-to-date from a user perspective. If dvc repro or dvc repro --dry is clean, this means we know that everything is fine.
If our datasets variable consists of a list of 4 datasets, and we have a stage eval with a foreach loop, this would result in
eval@dataset1
eval@dataset2
eval@dataset3
eval@dataset4
But dataset4 might not have the required labels.yaml yet, so it fails. Of course we could ignore errors, but dvc repro would still mark eval@dataset4 as dirty. To change that, we want to add a flag to the labels.yaml dependency that leads dvc to simply not consider the stage if it is missing. This means, in this scenario, dvc repro should only consider
eval@dataset1
eval@dataset2
eval@dataset3
as well as the corresponding downstream stages (maybe somehow mark/log them as not considered) . Nevertheless, dvc repro should be clean afterwards unless the file is added later on. Basically, we'd need a separation between "dependency does not exist" and "dependency exists, but is just not pulled to our local machine because we do not need it right now", which seems more to be the purpose of --alow-missing.
If the former is not feasable (as you'd need to check if some file is acutually references somewhere and therefore should exists at least one the remote), another option would be to allow subtractive for-loops/variables that are like "foreach $datasets except dataset4"
Currently, the way to mimic this is to create empty files and handle this case within our data handling package/script/program all the way down. Or to define a separate variable datasets_for_eval which just consists of the first three datasets, which both are not very elegant in my eyes.