dvc
dvc copied to clipboard
New command: dvc verify - check that the pipeline is up to date without having to pull or run it
Bringing this over from Discord.
What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:
- All deps of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
- All outs of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
- If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock
- All params match what's in dvc.lock
Essentially, this is the same as running dvc pull
then dvc repro --dry
and checking that "Data and pipelines are up to date", except I don't want to have to run dvc pull. This is especially important if you're working with large datasets, as pulling them every time on a CI machine could be quite costly in time and/or actual dollar bills :moneybag:
I played around a bit to see if there's a workaround with only the current functionality. Here's what I found.
dvc status
by itself will tell you if any non-cached deps (e.g. source code) don't match dvc.lock. That'll look like this:
train:
changed deps:
deleted: data/processed/data_train.npz
deleted: data/processed/data_val.npz
modified: src/train <-- This one
dvc status -c
will tell you if any outputs of any stages listed in dvc.lock aren't in remote storage. That'll look like this:
missing: data/processed/data_test.npz <-- This one
deleted: data/processed/data_train.npz
deleted: data/processed/data_val.npz
dvc params diff
will obviously catch param changes. I don't think anything covers point 3 above, and even if I stitched these all together, it would be very brittle as it relies on the outputs of all these commands not changing.
As always, I'm happy to contribute the change if you all think it would be valuable. I know I would use it right away in several projects. Or let me know if I'm simply overlooking some existing functionality that would serve the same purpose.
Hi @sjawhar !
As always, great suggestion! I think this feature has been requested before in different shapes and forms (can't pinpoint a particular ticket right now, sorry) and is generally related to dvc being able to operate in a "virtual environment" with remote and run-cache in mind, so that missing local cache is not a problem as long as stuff is available on remote. I think the functionality you've requested still makes sense for dvc status
(maybe as a flag) instead of a separate verify
command, unless that new separate command is more tailored to the pipelines (we've had this idea of smth like dvc pipeline/stage/e.g. status
for a long time).
Maybe this might fit into dvc exp
better these days? :thinking: (CC @dberenbaum @pmrowla) E.g. there will be remote executors and stuff, so maybe this global state will somehow fit better into that paradigm. This is just a random thought though, no more than a rant, so nevermind.
@efiop Although I haven't used status extensively, I agree that it intuitively makes sense as a flag for that command. I also agree with your final analysis that it probably doesn't make sense as part of dvc exp
:smile:
+1 for dvc verify
! It is very useful for CI tests that prevent you from going forward with a stale pipeline.
+1 I think this would be a very nice feature for when you are doing CI/CD tests to confirm your branch is ready to merge. I opened this https://github.com/iterative/cml/issues/681 issue because I thought it might be a cml feature but actually I think this would fit the bill.
Hi, I consider this feature very useful in my workflow as well. For the time being, if there is a workaround script available, would anyone share it?
Any updates on this feature? I would find it quite useful for my CI pipeline process.
tl;dr It's on our roadmap but not close enough to give an estimate for it yet.
A lot of the functionality today is part of dvc status
. Our plan is to separate the data management aspects of dvc status
from this pipeline-focused dvc verify
functionality (like dvc data status
and dvc stage status
; actual command names TBD) and make each more robust and useful on their own. We are currently focused on addressing gaps in data management generally, including updating dvc status
, and those improvements should be coming soon. Once that's complete, we should have more capacity to focus on pipeline improvements, and this feature request will be a high priority.
hello, do you have any updates on it? :)
(it seems like the 2nd upvoted issue 👀)
Sorry, I still don't have an estimate. It remains on our backlog because we have been focusing on rehauling a lot of the data management and haven't been able to put focus on pipelines. To that effect, we at least have dvc data status
now, so there is a clear path to starting on dvc stage status
once we have the capacity for it.
Hey, any news on it? Is there any ways we can provide help?
Would be highly beneficial to have such feature for our CI pipelines.
Oh, this would be immensely useful !
Sorry for the repeated delays here. We are finally planning this for the quarter.
Thanks to @sjawhar for opening this and listing clear requirements.
What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:
1. All deps of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock 2. All outs of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock 3. If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock 4. All params match what's in dvc.lock
Some thoughts on these:
3. If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock
Maybe I'm misunderstanding, but I think this is covered already by dvc status
, since otherwise it should report that either the dependency or output has changed, depending on which matches what's in your workspace. Since another requirement is that the data may be missing from the workspace, we can report on what would happen if you did dvc pull
(it would pull the output version of the data).
4. All params match what's in dvc.lock
Again, this should already be reported by dvc status
unless I'm misunderstanding.
In other words, taking a closer look here, the issue is covered by dvc status
except that the command needs to work without pulling all the data locally, which we are finally ready to do.
Since dvc status
is already overloaded and we now have dvc data status
, let's stick with the plan to put this in a new dvc stage status
command that only reports on the status of the pipeline and has some option (--remote
, --allow-missing
, or something similar) to check missing outputs against what's in the remote. The command shouldn't need any of the other features of dvc status
like comparing against other commits.
Some nuance on point 3
Maybe I'm misunderstanding, but I think this is covered already by dvc status, since otherwise it should report that either the dependency or output has changed, depending on which matches what's in your workspace. Since another requirement is that the data may be missing from the workspace, we can report on what would happen if you did dvc pull (it would pull the output version of the data).
I don't think that quite covers it. The out of one stage might exist in the remote, as might the dep of the next stage in the dag, but they don't agree with each other. So they both exist simply by checking the remote, but they aren't the same artifact. We would indeed need to do that collision detection that you hint at ("if you did dvc pull
")
I as a user want the possibility to check if e.g my college did in fact dvc pushed. For that i would neeed currently to dvc pull which takes a long time and then check with dvc status if remote hashes and the pulled in the dvc.lock are the same. So the idea (as i get it and agree on) on verify is do skip the step of dvc pulling as it takes a long time.
And on CI pipeline dvc pull is a unwanted necessity as the data is not kept anyway. So one "just" ones with verify to check the hashes which are in the current dvc.lock with the remote if those hashes exist there and could be pulled or in otherwords where pushed.
So the idea (as i get it and agree on) on verify is do skip the step of dvc pulling as it takes a long time.
Yup, sorry if the previous message wasn't clear. tldr we need to do a status check as if we had pulled the data but without actually pulling it. I think this also addresses the point from @sjawhar, but let me know if I'm missing something.
We haven't documented the behavior yet, but with the latest release, you should be able to use the --allow-missing
flag from #9437 (thanks @daavoo) to cover most of this functionality.
For example, run dvc [exp run/repro] --allow-missing [--dry]
and it will skip any stages that have missing data but are otherwise unchanged. If your pipeline is up to date other than having to pull, the command will succeed without running any of your stages. Otherwise, it will fail.
The only part I see missing from the requests above is checking whether all the data exists on the remote. This can already be achieved with dvc data status --json
, but maybe we can think of ways to make this simpler since you would have to parse the output.
We haven't documented the behavior yet, but with the latest release, you should be able to use the
--allow-missing
flag from #9437 (thanks @daavoo) to cover most of this functionality.For example, run
dvc [exp run/repro] --allow-missing [--dry]
and it will skip any stages that have missing data but are otherwise unchanged. If your pipeline is up to date other than having to pull, the command will succeed without running any of your stages. Otherwise, it will fail.The only part I see missing from the requests above is checking whether all the data exists on the remote. This can already be achieved with
dvc data status --json
, but maybe we can think of ways to make this simpler since you would have to parse the output.
Hi @dberenbaum, just tried to reproduce what you said.
dvc repro --allow-missing --dry
allows me to check if git tracked objects are up to date in the dvc.lock
I'm trying to check if all objects referenced in the dvc.lock
are in the remote cache, but when I try your command: dvc data status --json --not-in-remote
, I do not see any differences in the output when data is in the remote and when the data is not in the remote. Indeed, I do not understand how these 2 commands can cover the dvc verify
functionnality
EDIT: It come from bug referenced in #9541
My dvc verify
dirty code:
pip install -q dvc['s3']
dvc repro --allow-missing -q
pip install -q dvc['s3']==2.57
if dvc data status --not-in-remote | grep "Not in remote"
then
exit 1
fi
EDIT: Works with 3.0.0
pip install -q dvc['s3']==3.0.0
dvc repro --allow-missing -q
if dvc data status --not-in-remote | grep "Not in remote"
then
exit 1
fi