dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`dvc repro --dry` does not work with run-cache

Open PythonFZ opened this issue 3 years ago • 3 comments

Bug Report

Description

I was looking for a way to determine if a stage is cached and would just be loaded or will actually be executed. Therefore, I created a repository and initalized a very simple stage.

dvc stage add -n write_file -o file.txt 'echo Hello World > file.txt'

Then I cloned the repo and shared the cache:

dvc cache dir <new> <old>/.dvc/cache
# just using dvc cache <new> <old> could be a feature request here

Then in the original repository I ran dvc repro to cache the results. In the new one I tried dvc status which did not give me information if the result was available in the cache. So I tried dvc repro --dry. This gave me:

Running stage 'write_file':
> echo Hello World > file.txt
Use `dvc push` to send your updates to remote storage.

whilst using dvc repro resulted in a different output

Stage 'write_file' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add .gitignore dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

Is it possible to check if a stage would be loaded or executed without actually executing the stage?

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.18.1 (pip)
---------------------------------
Platform: Python 3.10.4 on Windows-10-10.0.19044-SP0
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        webhdfs (fsspec = 2022.7.1)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: None
Workspace directory: NTFS on D:\
Repo: dvc, git

PythonFZ avatar Sep 21 '22 11:09 PythonFZ

Seems like a bug of --dry

reproduction script:

#!/bin/bash

set -exu
pushd $TMPDIR

wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

orig=$(pwd)

git init
dvc init

echo data >> data

dvc add data
dvc run -d data -o out -n train "cp data out"

git add -A
git commit -am "initial"

popd

git clone test_repo new_repo
pushd new_repo

dvc cache dir $orig/.dvc/cache

dvc checkout data.dvc

dvc repro --dry

dvc repro

Last steps result:

+ dvc repro --dry
'data.dvc' didn't change, skipping                                    
Running stage 'train':
> cp data out
Use `dvc push` to send your updates to remote storage.
+ dvc repro
'data.dvc' didn't change, skipping                                    
Stage 'train' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.       

We should be informing if we don't have to reproduce.

pared avatar Sep 21 '22 13:09 pared

This is not a bug. We cannot use stage cache to know if some stages will be re-run or not, we can only know so when we try to run that particular stage. Statically, we may not know. Also, run-cache involves checking out outputs, removing outputs, pulling outputs, etc, so dry is not the correct flag to use here.

skshetry avatar Sep 21 '22 14:09 skshetry

@skshetry maybe bug is too much but I guess feature request? From user point of view it doesn't matter if we run or run --dry

pared avatar Sep 21 '22 14:09 pared