dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Granular pipeline dependency status

Open johan-sightic opened this issue 2 years ago • 5 comments

Feature Request

When I run dvc repro dvc detects which dependencies have changed an therefore which stages needs to be reproduced. I would like to access the granular changes of all the dependencies for a stage since it was last reproduced.

Example usage

Change in dependencies for stage preprocess:

$ dvc stage status preprocess --granular --json  (New command or addition to "dvc status" or "dvc data status")
{
    "new": [
        "path/to/new/dependency/file",
        ...
    ],
    "modified": [
        "path/to/modified/dependency/file",
        ...
    ],
    "deleted": [
        "path/to/deleted/dependency/file",
        ...
    ]
}

Motivation

This feature would be very useful for pipelines which process many independent samples and take a long time to run.

Imagine the following simple data setup where samples get preprocessed and stored in a new folder.

data
├── raw
│   ├── sample_001.jpeg
│   └── sample_002.jpeg
└── preprocessed
    ├── sample_001.jpeg
    └── sample_002.jpeg

And the corresponding simple pipeline.

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
      - preprocess.py
      - data/raw/
    outs:
      - data/preprocessed:
        persist: true

With this feature the pipeline stage code could check which samples have changed (new/modified/deleted) and only process those. It could also detect that the code has changed and reprocess all samples.

This would save me a lot of time since we have a long and slow pipeline where the raw data gets updated quite often. Link to extended Discord discussion: https://discord.com/channels/485586884165107732/1093361005754585109 Link to another discussion of the same problem: https://github.com/iterative/dvc/discussions/5917

johan-sightic avatar May 10 '23 06:05 johan-sightic

Thanks @johan-sightic!

dberenbaum avatar May 10 '23 17:05 dberenbaum

I think this will be part of #5369. Closing in favor of that.

skshetry avatar May 11 '23 03:05 skshetry

Hey @skshetry, I think it's possible they are part of the same command, but I don't want to conflate the two since the use cases are pretty different. For example, the changes @daavoo has been working on in dvc repro may be enough that a new status command isn't that important for #5369. I also don't see anything in #5369 asking about granularity.

dberenbaum avatar May 11 '23 12:05 dberenbaum

Is there any progress on this or any alternative solutions?

johan-sightic avatar Jan 03 '24 08:01 johan-sightic

The latest is what was discussed in #10042. Nothing further at the moment unfortunately.

dberenbaum avatar Jan 04 '24 17:01 dberenbaum