dvc feat: ignore files not in remote when push is false

Fixes #10317

Enables opt-in to remove push: false stage outputs from not_in_remote data status results.

Notable changes:

Add outs_no_push to dvc.stage.utils.fill_stage_outputs keys, to facilitate making outputs with push: false.
In status, when flag enabled, filter through files reported as not_in_remote, and remove them if not can_push.
Add corresponding flag --respect-no-push flag to CLI

Open to suggestions on how to make the flag names more intuitive!

Corresponding PR for the docs: https://github.com/iterative/dvc.org/pull/5373

[x] ❗ I have followed the Contributing to DVC checklist.
[x] 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

May 22 '25 11:05 Northo

Codecov Report

Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.

Project coverage is 91.06%. Comparing base (2431ec6) to head (b6b18ef). Report is 68 commits behind head on main.

Files with missing lines	Patch %	Lines
dvc/repo/data.py	78.94%	3 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10749      +/-   ##
==========================================
+ Coverage   90.68%   91.06%   +0.38%     
==========================================
  Files         504      504              
  Lines       39795    40040     +245     
  Branches     3141     3164      +23     
==========================================
+ Hits        36087    36462     +375     
+ Misses       3042     2950      -92     
+ Partials      666      628      -38

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 22 '25 11:05 codecov[bot]

@skshetry, have you had time to look at this? This feature would be really great for our team!

Jun 05 '25 09:06 Northo

@skshetry, thanks for the thorough review, and sorry for the very late reply from me.

When reading up on your suggestions, especially on the change from diffing index - workspace to inspecting index directly, I realized I am somewhat confused about the intended/expected behavior of dvc data status.

What is the difference between dvc data status and dvc status? In the docs, it says:

dvc status

Show changes in the project pipelines, as well as file mismatches either between the cache and workspace, or between the cache and remote storage. For the status of tracked data, see dvc data status (similar to git status).
dvc data status

Show changes to the files and directories tracked by DVC in the workspace. For the status of data pipelines, see dvc status.

I first thought this meant there was a difference between files generated with pipelines (dvc.yaml) and directly tracked (<filename>.dvc). However, after some investigation, I believe I have misunderstood.

I made a simple example to investigate. foo.txt added directly, bar.txt created with pipeline. Run dvc status -c and dvc data status --not-in-remote at different states. The results are below.

❯ source ./run_demo.sh
==================================
## Before repro
==================================
>>> dvc status -c --json
{
  "bar.txt": "missing",
  "foo.txt": "missing"
}

>>> dvc status --json
{
  "create-bar": [
    {
      "changed outs": {
        "bar.txt": "not in cache"
      }
    }
  ],
  "foo.txt.dvc": [
    {
      "changed outs": {
        "foo.txt": "not in cache"
      }
    }
  ]
}

>>> dvc data status --not-in-remote --json
{
  "not_in_cache": [
    "foo.txt",
    "bar.txt"
  ],
  "not_in_remote": [
    "foo.txt",
    "bar.txt"
  ],
  "committed": {
    "not_in_remote": [
      "foo.txt",
      "bar.txt"
    ]
  }
}


==================================
## Running repro
==================================
>>> dvc repro
Running stage 'create-bar':
> echo bar > bar.txt
Use `dvc push` to send your updates to remote storage.


==================================
## After repro
==================================
>>> dvc status -c --json
{
  "bar.txt": "new",
  "foo.txt": "missing"
}

>>> dvc status --json
{
  "foo.txt.dvc": [
    {
      "changed outs": {
        "foo.txt": "not in cache"
      }
    }
  ]
}

>>> dvc data status --not-in-remote --json
{
  "not_in_cache": [
    "foo.txt"
  ],
  "not_in_remote": [
    "foo.txt",
    "bar.txt"
  ],
  "committed": {
    "not_in_remote": [
      "foo.txt",
      "bar.txt"
    ]
  }
}


==================================
## Add and commit foo
==================================
echo foo > foo.txt
>>> dvc commit foo.txt


==================================
## After commit
==================================
>>> dvc status -c --json
{
  "bar.txt": "new",
  "foo.txt": "new"
}

>>> dvc status --json
{}

>>> dvc data status --not-in-remote --json
{
  "not_in_remote": [
    "bar.txt",
    "foo.txt"
  ],
  "committed": {
    "not_in_remote": [
      "bar.txt",
      "foo.txt"
    ]
  }
}


==================================
## Running push
==================================
>>> dvc push
Collecting                                                                                                                                                    |2.00 [00:00,  521entry/s]
Pushing
2 files pushed


==================================
## After push
==================================
>>> dvc status -c --json
{}

>>> dvc status --json
{}

>>> dvc data status --not-in-remote --json
{
  "committed": {
    "not_in_remote": [
      "bar.txt",
      "foo.txt"
    ]
  }
}

It does seem to contain the same information, structured slightly differently. Is there a subtle difference here I am missing, or do they have overlapping functionality? Thanks for any help in clarifying this.

Also: the comitted.not_in_remote entries seem a bit strange?

Repro of example

>>> .dvc/.gitignore
/config.local
/tmp
/cache
>>> .dvc/config
[core]
    remote = localremote
['remote "localremote"']
    url = ../localremote
>>> .dvcignore
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
>>> .gitignore
/bar.txt
/foo.txt
localremote
>>> dvc.lock
schema: '2.0'
stages:
  create-bar:
    cmd: echo bar > bar.txt
    outs:
    - path: bar.txt
      hash: md5
      md5: c157a79031e1c40f85931829bc5fc552
      size: 4
>>> dvc.yaml
stages:
  create-bar:
    cmd: echo bar > bar.txt
    outs:
    - bar.txt
>>> foo.txt.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  hash: md5
  path: foo.txt
>>> run_demo.sh
#!/bin/bash

## Clean
rm -rf localremote .dvc/cache .dvc/tmp

run() {
    echo ">>> $@"
    "$@"
    echo
}

status() {
  run dvc status -c --json
  run dvc status --json
  run dvc data status --not-in-remote --json
}

echo "=================================="
echo "## Before repro"
echo "=================================="
status

echo
echo "=================================="
echo "## Running repro"
echo "=================================="
run dvc repro

echo
echo "=================================="
echo "## After repro"
echo "=================================="
status


echo
echo "=================================="
echo "## Add and commit foo"
echo "=================================="
echo "echo foo > foo.txt"
echo "foo" > foo.txt
run dvc commit foo.txt

echo
echo "=================================="
echo "## After commit"
echo "=================================="
status

echo
echo "=================================="
echo "## Running push"
echo "=================================="
run dvc push


echo
echo "=================================="
echo "## After push"
echo "=================================="
status

Jun 18 '25 08:06 Northo

What is the difference between dvc data status and dvc status? In the docs, it says:

The dvc status command shows the state of your pipelines by detecting changes in tracked outputs, dependencies, and the commands. However, its scope is limited, and it can only indicate whether the tracked dependency/output/command has changed or not. It does not show you how the data changed. For example, there is no way to see granular changes within a tracked directory, which was an often requested feature:

https://github.com/iterative/dvc/issues/2180

dvc data status supports showing granular changes with --granular (ideally this should be the default if we fix performance issues with it). It is also custom built as a data(set) management command, to show you the current state of your tracked datasets, based on user's feedback asking for a tool to understand the state of tracked data.

The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike dvc status). If there's a demand for filtering those out, dvc data status would support it, but dvc status is unlikely to support that.

dvc data status command is focused on data, while dvc status is focused on pipelines.

dvc data status also powers the file-tree view and decorations in the "DVC Extension for VSCode". So some requirements also came from there.

Jun 18 '25 09:06 skshetry

If you are using dvc for data management, use data status. If you are using it to check changes to your pipelines, use dvc status. data status is a new command, so any new features related to data/data-management are likely going to be implemented there than in status.

Jun 18 '25 09:06 skshetry

What is the difference between dvc data status and dvc status? In the docs, it says:

The dvc status command shows the state of your pipelines by detecting changes in tracked outputs, dependencies, and the commands. However, its scope is limited, and it can only indicate whether the tracked dependency/output/command has changed or not. It does not show you how the data changed. For example, there is no way to see granular changes within a tracked directory, which was an often requested feature:
* [status: granular output for directories #2180](https://github.com/iterative/dvc/issues/2180)
dvc data status supports showing granular changes with --granular (ideally this should be the default if we fix performance issues with it). It is also custom built as a data(set) management command, to show you the current state of your tracked datasets, based on user's feedback asking for a tool to understand the state of tracked data.

The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike dvc status). If there's a demand for filtering those out, dvc data status would support it, but dvc status is unlikely to support that.

dvc data status command is focused on data, while dvc status is focused on pipelines.

dvc data status also powers the file-tree view and decorations in the "DVC Extension for VSCode". So some requirements also came from there.

I see, thank you, that helps. I just got a bit confused by the not-in-remote being computed inside the _diff. I'll respond directly in the review comments for specific questions. Will try to have a reviewed PR ready for end of week.

Jun 18 '25 14:06 Northo

I just got a bit confused by the not-in-remote being computed inside the _diff.

Note that _diff(...) here also returns "unchanged" items. That's because we set with_unchanged=True.

https://github.com/iterative/dvc/blob/c7c7ba69fbe093b0889228119964f357653e6973/dvc/repo/data.py#L70-L73

So while it's called a _diff(), it effectively yields a full list of items from both sides of the index - items that may have been added, removed, modified, or left unchanged. In that sense, it's behaving more like a complete listing, similar to index.iteritems(), as discussed earlier: https://github.com/iterative/dvc/pull/10749#discussion_r2135760239. (not_in_remote is only applied to change.old, which under _diff_index_to_wtree() corresponds to items from repo.index, change.new comes from the worktree index).

The unchanged items are always computed unconditionally here, but only shown if --unchanged is explicitly passed in the CLI.

(The --unchanged flag is used by the DVC Extension for VSCode to render the file tree.)

Jun 18 '25 15:06 skshetry

@skshetry, updated the PR now, to use the worktree_view based approach.

Bit premature there... Need to sort out some issues before review.

Jun 19 '25 12:06 Northo

It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated.

Jun 19 '25 14:06 Northo

It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated.

Looks unrelated, maybe new pytest release is to blame. Please ignore, that'd fail on main too, but maybe we were lucky. I'll investigate separately.

Jun 19 '25 15:06 skshetry

Thanks for the guidance!

Ps. Also took the liberty to swap out the kwargs in status for explicit arguments.

Jun 20 '25 08:06 Northo

@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩

Jul 02 '25 07:07 Northo

@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩

I am planning to release by early next week.

Jul 02 '25 10:07 skshetry

@Northo, I have created a new release. :)

Jul 08 '25 11:07 skshetry

dvc dvc copied to clipboard

feat: ignore files not in remote when push is false

Codecov Report

dvc
dvc copied to clipboard