dvc
dvc copied to clipboard
feat: ignore files not in remote when push is false
Fixes #10317
Enables opt-in to remove push: false stage outputs from not_in_remote data status results.
Notable changes:
- Add
outs_no_pushtodvc.stage.utils.fill_stage_outputskeys, to facilitate making outputs withpush: false. - In
status, when flag enabled, filter through files reported asnot_in_remote, and remove them if notcan_push. - Add corresponding flag
--respect-no-pushflag to CLI
Open to suggestions on how to make the flag names more intuitive!
Corresponding PR for the docs: https://github.com/iterative/dvc.org/pull/5373
-
[x] ❗ I have followed the Contributing to DVC checklist.
-
[x] 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Codecov Report
Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.
Project coverage is 91.06%. Comparing base (
2431ec6) to head (b6b18ef). Report is 68 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| dvc/repo/data.py | 78.94% | 3 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #10749 +/- ##
==========================================
+ Coverage 90.68% 91.06% +0.38%
==========================================
Files 504 504
Lines 39795 40040 +245
Branches 3141 3164 +23
==========================================
+ Hits 36087 36462 +375
+ Misses 3042 2950 -92
+ Partials 666 628 -38
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@skshetry, have you had time to look at this? This feature would be really great for our team!
@skshetry, thanks for the thorough review, and sorry for the very late reply from me.
When reading up on your suggestions, especially on the change from diffing index - workspace to inspecting index directly, I realized I am somewhat confused about the intended/expected behavior of dvc data status.
What is the difference between dvc data status and dvc status? In the docs, it says:
-
dvc statusShow changes in the project pipelines, as well as file mismatches either between the cache and workspace, or between the cache and remote storage. For the status of tracked data, see dvc data status (similar to git status).
-
dvc data statusShow changes to the files and directories tracked by DVC in the workspace. For the status of data pipelines, see dvc status.
I first thought this meant there was a difference between files generated with pipelines (dvc.yaml) and directly tracked (<filename>.dvc).
However, after some investigation, I believe I have misunderstood.
I made a simple example to investigate. foo.txt added directly, bar.txt created with pipeline. Run dvc status -c and dvc data status --not-in-remote at different states. The results are below.
❯ source ./run_demo.sh
==================================
## Before repro
==================================
>>> dvc status -c --json
{
"bar.txt": "missing",
"foo.txt": "missing"
}
>>> dvc status --json
{
"create-bar": [
{
"changed outs": {
"bar.txt": "not in cache"
}
}
],
"foo.txt.dvc": [
{
"changed outs": {
"foo.txt": "not in cache"
}
}
]
}
>>> dvc data status --not-in-remote --json
{
"not_in_cache": [
"foo.txt",
"bar.txt"
],
"not_in_remote": [
"foo.txt",
"bar.txt"
],
"committed": {
"not_in_remote": [
"foo.txt",
"bar.txt"
]
}
}
==================================
## Running repro
==================================
>>> dvc repro
Running stage 'create-bar':
> echo bar > bar.txt
Use `dvc push` to send your updates to remote storage.
==================================
## After repro
==================================
>>> dvc status -c --json
{
"bar.txt": "new",
"foo.txt": "missing"
}
>>> dvc status --json
{
"foo.txt.dvc": [
{
"changed outs": {
"foo.txt": "not in cache"
}
}
]
}
>>> dvc data status --not-in-remote --json
{
"not_in_cache": [
"foo.txt"
],
"not_in_remote": [
"foo.txt",
"bar.txt"
],
"committed": {
"not_in_remote": [
"foo.txt",
"bar.txt"
]
}
}
==================================
## Add and commit foo
==================================
echo foo > foo.txt
>>> dvc commit foo.txt
==================================
## After commit
==================================
>>> dvc status -c --json
{
"bar.txt": "new",
"foo.txt": "new"
}
>>> dvc status --json
{}
>>> dvc data status --not-in-remote --json
{
"not_in_remote": [
"bar.txt",
"foo.txt"
],
"committed": {
"not_in_remote": [
"bar.txt",
"foo.txt"
]
}
}
==================================
## Running push
==================================
>>> dvc push
Collecting |2.00 [00:00, 521entry/s]
Pushing
2 files pushed
==================================
## After push
==================================
>>> dvc status -c --json
{}
>>> dvc status --json
{}
>>> dvc data status --not-in-remote --json
{
"committed": {
"not_in_remote": [
"bar.txt",
"foo.txt"
]
}
}
It does seem to contain the same information, structured slightly differently. Is there a subtle difference here I am missing, or do they have overlapping functionality? Thanks for any help in clarifying this.
Also: the comitted.not_in_remote entries seem a bit strange?
Repro of example
>>> .dvc/.gitignore
/config.local
/tmp
/cache
>>> .dvc/config
[core]
remote = localremote
['remote "localremote"']
url = ../localremote
>>> .dvcignore
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
>>> .gitignore
/bar.txt
/foo.txt
localremote
>>> dvc.lock
schema: '2.0'
stages:
create-bar:
cmd: echo bar > bar.txt
outs:
- path: bar.txt
hash: md5
md5: c157a79031e1c40f85931829bc5fc552
size: 4
>>> dvc.yaml
stages:
create-bar:
cmd: echo bar > bar.txt
outs:
- bar.txt
>>> foo.txt.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
size: 4
hash: md5
path: foo.txt
>>> run_demo.sh
#!/bin/bash
## Clean
rm -rf localremote .dvc/cache .dvc/tmp
run() {
echo ">>> $@"
"$@"
echo
}
status() {
run dvc status -c --json
run dvc status --json
run dvc data status --not-in-remote --json
}
echo "=================================="
echo "## Before repro"
echo "=================================="
status
echo
echo "=================================="
echo "## Running repro"
echo "=================================="
run dvc repro
echo
echo "=================================="
echo "## After repro"
echo "=================================="
status
echo
echo "=================================="
echo "## Add and commit foo"
echo "=================================="
echo "echo foo > foo.txt"
echo "foo" > foo.txt
run dvc commit foo.txt
echo
echo "=================================="
echo "## After commit"
echo "=================================="
status
echo
echo "=================================="
echo "## Running push"
echo "=================================="
run dvc push
echo
echo "=================================="
echo "## After push"
echo "=================================="
status
What is the difference between
dvc data statusanddvc status? In the docs, it says:
The dvc status command shows the state of your pipelines by detecting changes in tracked outputs, dependencies, and the commands. However, its scope is limited, and it can only indicate whether the tracked dependency/output/command has changed or not. It does not show you how the data changed. For example, there is no way to see granular changes within a tracked directory, which was an often requested feature:
- https://github.com/iterative/dvc/issues/2180
dvc data status supports showing granular changes with --granular (ideally this should be the default if we fix performance issues with it). It is also custom built as a data(set) management command,
to show you the current state of your tracked datasets, based on user's feedback asking for a tool to understand the state of tracked data.
The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike dvc status). If there's a demand for filtering those out, dvc data status would support it, but dvc status is unlikely to support that.
dvc data status command is focused on data, while dvc status is focused on pipelines.
dvc data status also powers the file-tree view and decorations in the "DVC Extension for VSCode". So some requirements also came from there.
If you are using dvc for data management, use data status. If you are using it to check changes to your pipelines, use dvc status. data status is a new command, so any new features related to data/data-management are likely going to be implemented there than in status.
What is the difference between
dvc data statusanddvc status? In the docs, it says:The
dvc statuscommand shows the state of your pipelines by detecting changes in tracked outputs, dependencies, and the commands. However, its scope is limited, and it can only indicate whether the tracked dependency/output/command has changed or not. It does not show you how the data changed. For example, there is no way to see granular changes within a tracked directory, which was an often requested feature:* [status: granular output for directories #2180](https://github.com/iterative/dvc/issues/2180)
dvc data statussupports showing granular changes with--granular(ideally this should be the default if we fix performance issues with it). It is also custom built as a data(set) management command, to show you the current state of your tracked datasets, based on user's feedback asking for a tool to understand the state of tracked data.The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike
dvc status). If there's a demand for filtering those out,dvc data statuswould support it, butdvc statusis unlikely to support that.
dvc data statuscommand is focused on data, whiledvc statusis focused on pipelines.
dvc data statusalso powers the file-tree view and decorations in the "DVC Extension for VSCode". So some requirements also came from there.
I see, thank you, that helps. I just got a bit confused by the not-in-remote being computed inside the _diff. I'll respond directly in the review comments for specific questions. Will try to have a reviewed PR ready for end of week.
I just got a bit confused by the
not-in-remotebeing computed inside the_diff.
Note that _diff(...) here also returns "unchanged" items. That's because we set with_unchanged=True.
https://github.com/iterative/dvc/blob/c7c7ba69fbe093b0889228119964f357653e6973/dvc/repo/data.py#L70-L73
So while it's called a _diff(), it effectively yields a full list of items from both sides of the index - items that may have been added, removed, modified, or left unchanged. In that sense, it's behaving more like a complete listing, similar to index.iteritems(), as discussed earlier: https://github.com/iterative/dvc/pull/10749#discussion_r2135760239.
(not_in_remote is only applied to change.old, which under _diff_index_to_wtree() corresponds to items from repo.index, change.new comes from the worktree index).
The unchanged items are always computed unconditionally here, but only shown if --unchanged is explicitly passed in the CLI.
(The --unchanged flag is used by the DVC Extension for VSCode to render the file tree.)
@skshetry, updated the PR now, to use the worktree_view based approach.
Bit premature there... Need to sort out some issues before review.
It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated.
It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated.
Looks unrelated, maybe new pytest release is to blame. Please ignore, that'd fail on main too, but maybe we were lucky. I'll investigate separately.
Thanks for the guidance!
Ps. Also took the liberty to swap out the kwargs in status for explicit arguments.
@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩
@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩
I am planning to release by early next week.
@Northo, I have created a new release. :)