dvc
dvc copied to clipboard
Granular pipeline dependency status
Feature Request
When I run dvc repro dvc detects which dependencies have changed an therefore which stages needs to be reproduced. I would like to access the granular changes of all the dependencies for a stage since it was last reproduced.
Example usage
Change in dependencies for stage preprocess:
$ dvc stage status preprocess --granular --json (New command or addition to "dvc status" or "dvc data status")
{
"new": [
"path/to/new/dependency/file",
...
],
"modified": [
"path/to/modified/dependency/file",
...
],
"deleted": [
"path/to/deleted/dependency/file",
...
]
}
Motivation
This feature would be very useful for pipelines which process many independent samples and take a long time to run.
Imagine the following simple data setup where samples get preprocessed and stored in a new folder.
data
├── raw
│ ├── sample_001.jpeg
│ └── sample_002.jpeg
└── preprocessed
├── sample_001.jpeg
└── sample_002.jpeg
And the corresponding simple pipeline.
stages:
preprocess:
cmd: python preprocess.py
deps:
- preprocess.py
- data/raw/
outs:
- data/preprocessed:
persist: true
With this feature the pipeline stage code could check which samples have changed (new/modified/deleted) and only process those. It could also detect that the code has changed and reprocess all samples.
This would save me a lot of time since we have a long and slow pipeline where the raw data gets updated quite often. Link to extended Discord discussion: https://discord.com/channels/485586884165107732/1093361005754585109 Link to another discussion of the same problem: https://github.com/iterative/dvc/discussions/5917
Thanks @johan-sightic!
I think this will be part of #5369. Closing in favor of that.
Hey @skshetry, I think it's possible they are part of the same command, but I don't want to conflate the two since the use cases are pretty different. For example, the changes @daavoo has been working on in dvc repro may be enough that a new status command isn't that important for #5369. I also don't see anything in #5369 asking about granularity.
Is there any progress on this or any alternative solutions?
The latest is what was discussed in #10042. Nothing further at the moment unfortunately.