dvc
dvc copied to clipboard
`dvc repro --dry` does not work with run-cache
Bug Report
Description
I was looking for a way to determine if a stage is cached and would just be loaded or will actually be executed. Therefore, I created a repository and initalized a very simple stage.
dvc stage add -n write_file -o file.txt 'echo Hello World > file.txt'
Then I cloned the repo and shared the cache:
dvc cache dir <new> <old>/.dvc/cache
# just using dvc cache <new> <old> could be a feature request here
Then in the original repository I ran dvc repro to cache the results.
In the new one I tried dvc status which did not give me information if the result was available in the cache.
So I tried dvc repro --dry.
This gave me:
Running stage 'write_file':
> echo Hello World > file.txt
Use `dvc push` to send your updates to remote storage.
whilst using dvc repro resulted in a different output
Stage 'write_file' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add .gitignore dvc.lock
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Is it possible to check if a stage would be loaded or executed without actually executing the stage?
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 2.18.1 (pip)
---------------------------------
Platform: Python 3.10.4 on Windows-10-10.0.19044-SP0
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
webhdfs (fsspec = 2022.7.1)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: None
Workspace directory: NTFS on D:\
Repo: dvc, git
Seems like a bug of --dry
reproduction script:
#!/bin/bash
set -exu
pushd $TMPDIR
wsp=test_wspace
rep=test_repo
rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)
mkdir $rep && pushd $rep
orig=$(pwd)
git init
dvc init
echo data >> data
dvc add data
dvc run -d data -o out -n train "cp data out"
git add -A
git commit -am "initial"
popd
git clone test_repo new_repo
pushd new_repo
dvc cache dir $orig/.dvc/cache
dvc checkout data.dvc
dvc repro --dry
dvc repro
Last steps result:
+ dvc repro --dry
'data.dvc' didn't change, skipping
Running stage 'train':
> cp data out
Use `dvc push` to send your updates to remote storage.
+ dvc repro
'data.dvc' didn't change, skipping
Stage 'train' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.
We should be informing if we don't have to reproduce.
This is not a bug. We cannot use stage cache to know if some stages will be re-run or not, we can only know so when we try to run that particular stage. Statically, we may not know. Also, run-cache involves checking out outputs, removing outputs, pulling outputs, etc, so dry is not the correct flag to use here.
@skshetry maybe bug is too much but I guess feature request?
From user point of view it doesn't matter if we run or run --dry