purge: Add dvc purge command
This PR introduces a new dvc purge command to remove DVC-tracked outputs and their cache, while leaving stage metadata (.dvc files, dvc.yaml) intact. It's intended as a safer/faster alternative to manually deleting files and cache when cleaning up a workspace.
CLI
dvc purge [targets...] [--recursive] [--dry-run] [-f|--force] [-y]
targets...: optional list of specific files/directories to purge. If omitted, the entire repo is considered.--recursive,-r: recurse into directories.--dry-run: show what would be removed, without deleting anything.--force,-f: bypass safety checks (dirty outputs, remote backup).--yes,-y: skip confirmation prompt.
Behaviour
- Collect outputs (outs) from .dvc files and dvc.yaml.
- For each output:
- Remove workspace copies (files/dirs).
- Remove corresponding objects from the local cache.
- Stage metadata remains intact.
- Non-DVC files are never touched.
Safety Checks
Before purging, DVC performs two safety checks:
-
Dirty outputs – if an output has been modified in the workspace and differs from cache:
- Abort with PurgeError unless
--forceis used.
- Abort with PurgeError unless
-
Remote backup – if a default remote is configured, verify that all outputs are present remotely:
- If missing -> abort unless
--force. - If no remote is configured -> abort unless
--force. - With
--force, purge proceeds but logs a warning that data may be permanently lost.
Example
$ dvc purge --dry-run
WARNING: This will permanently remove local DVC-tracked outputs for the entire workspace.
(dry-run: showing what would be removed, no changes).
ERROR: No default remote configured. Cannot safely purge outputs without verifying remote backup.
Use `--force` to purge anyway.
$ dvc purge --force -y
WARNING: This will permanently remove local DVC-tracked outputs for the entire workspace.
WARNING: No default remote configured. Proceeding with purge due to --force. Outputs may be permanently lost.
Removed 5 outputs (workspace + cache).
Tests
- ✅ Purge removes both workspace + cache copies, leaves .dvc metadata.
- ✅ Purge with targets removes only matching outs.
- ✅ Recursive purge works on nested dirs.
- ✅ Dry-run lists removals without making changes
- ✅ Dirty outs raise error unless
--force - ✅ Missing remote / missing objects raise error unless
--force - ✅ CLI tests for confirmation, -y, and force behavior.
Fixes #10874 Docs will be added in https://github.com/iterative/dvc.org/pull/5464
Codecov Report
:x: Patch coverage is 95.20958% with 16 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 91.04%. Comparing base (2431ec6) to head (8b073f8).
:warning: Report is 163 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| dvc/repo/purge.py | 87.71% | 9 Missing and 5 partials :warning: |
| dvc/commands/purge.py | 94.28% | 1 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #10880 +/- ##
==========================================
+ Coverage 90.68% 91.04% +0.36%
==========================================
Files 504 508 +4
Lines 39795 41301 +1506
Branches 3141 3276 +135
==========================================
+ Hits 36087 37603 +1516
- Misses 3042 3056 +14
+ Partials 666 642 -24
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Specific question for reviewers: are there any parts of the code for which there are existing helpers in the codebase I don't know about (I haven't done much dev in DVC)
Note: reviewers can get a feel for the tool by running:
# 1. Initialize a Git repo
git init dvc-repo
cd dvc-repo
# 2. Initialize DVC
dvc init
# 3. Create a few 1MB junk files
for i in (seq 1 5)
head -c 1M </dev/urandom > file_$i.bin
end
# 4. Add the files to DVC
dvc add file_*.bin
# 5. Commit changes to Git
git add .
git commit -m "Initialize DVC repo with 1MB junk files"
(be sure to have the dvc version installed).
1. Preview what files would be deleted
$ dvc purge --dry-run
WARNING: This will show what local DVC-tracked outputs would be removed for the entire workspace.
(dry-run: showing what would be removed, no changes).
ERROR: No default remote configured. Cannot safely purge outputs without verifying remote backup.
Use `--force` to purge anyway.
2. Preview what files would be deleted (with --force)
$ dvc purge --dry-run --force
WARNING: This will show what local DVC-tracked outputs would be removed for the entire workspace.
(dry-run: showing what would be removed, no changes).
WARNING: No default remote configured. Proceeding with purge due to --force. Outputs may be permanently lost.
[dry-run] Would remove file_4.bin
[dry-run] Would remove file_5.bin
[dry-run] Would remove file_1.bin
[dry-run] Would remove file_3.bin
[dry-run] Would remove file_2.bin
Nothing to purge.
3. Try and purge files that aren't backed up
$ dvc purge
WARNING: This will permanently remove local DVC-tracked outputs for the entire workspace.
Are you sure you want to proceed? [y/n]: y
ERROR: Some outputs are not present in the remote cache and would be permanently lost if purged:
- file_4.bin
- file_5.bin
- file_1.bin
- file_3.bin
- file_2.bin
Use `--force` to purge anyway.
4. Change a file, preview warnings
# append 10 random bytes at the end
$ dd if=/dev/urandom bs=1 count=10 >> file_1.bin
$ dvc purge --dry-run
WARNING: This will show what local DVC-tracked outputs would be removed for the entire workspace.
(dry-run: showing what would be removed, no changes).
ERROR: Some tracked outputs have uncommitted changes. Use `--force` to purge anyway.
- file_1.bin
5. Set up remote, preview what would be removed
$ mkdir -p /tmp/dvc-remote
$ dvc remote add -d local_remote /tmp/dvc-remote
$ dvc push
$ dvc purge --dry-run
WARNING: This will show what local DVC-tracked outputs would be removed for the entire workspace.
(dry-run: showing what would be removed, no changes).
[dry-run] Would remove file_4.bin
[dry-run] Would remove file_5.bin
[dry-run] Would remove file_1.bin
[dry-run] Would remove file_3.bin
[dry-run] Would remove file_2.bin
Nothing to purge.
6. Purge files that are confirmed to be backed up
$ dvc purge -y
WARNING: This will permanently remove local DVC-tracked outputs for the entire workspace.
Removed 5 outputs (workspace + cache).
Hi, thank you for creating the pull request. I am OOO, so please give me a few days for me to review this (and the problem statement/issue itself).
@rgoya does this fit your needs? Are there any features that you need that aren't represented
@rgoya does this fit your needs? Are there any features that you need that aren't represented
Thanks for setting this up @Wheest ! Description of functionality looks pretty good. Maybe one more flag for your consideration: purging cache to match workspace.
In cases in which rm or other non-dvc purge mechanism was used to remove files/dirs from the workspace, an option to clear the cache of anything that is not currently represented in the workspace (but is in the remote) would be quite useful.
--not-checked-out (or similar): safely remove cached files for files that are not checked out
- User runs
dvc pull/dvc checkoutinstantiatesNfiles across the workspace - User deletes
Mfiles from the workspace dvc purge --not-checked-outwould:- Collect outputs (outs) from .dvc files and dvc.yaml. (
N) - Identify outputs without copy in workspace (
M) - For each output not in workspace (
M):- Remove corresponding objects from the local cache.
- Collect outputs (outs) from .dvc files and dvc.yaml. (
Maybe one more flag for your consideration: purging cache to match workspace.
I think that might already be covered by gc, e.g., with
dvc gc --workspace
I'd say that gc is focussed on cleaning up the cache, whereas purge is about cleaning up the workspace (it also has some cache cleaning properties, but it's not the main focus).
Maybe one more flag for your consideration: purging cache to match workspace.
I think that might already be covered by
gc, e.g., withdvc gc --workspaceI'd say that
gcis focussed on cleaning up the cache, whereaspurgeis about cleaning up the workspace (it also has some cache cleaning properties, but it's not the main focus).
Unfortunately dvc gc --workspace considers "things you have in your workspace" as "any object currently referenced by a *.dvc or dvc.lock file in your repo", regardless of whether you have it instantiated in your workspace or not.
# Setup repo
dvc init --subdir .
dvc remote add -d --project test s3://$BUCKET/purge_test/
git add .dvc/config
# Add two files
echo foo > bar.txt
echo baz > qux.txt
dvc add bar.txt qux.txt
dvc push
git add bar.txt.dvc qux.txt.dv
git commit -m "dvc test files" -n
# Check remote status
dvc status --cloud
# Cache and remote 'test' are in sync.
cat bar.txt.dvc | grep "md5:"
#- md5: d3b07384d113edec49eaa6238ad5ff00
cat qux.txt.dvc | grep "md5:"
#- md5: 258622b1688250cb619f3c9ccaefb7eb
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./.dvc/cache/files/md5/d3/b07384d113edec49eaa6238ad5ff00
#./.dvc/cache/files/md5/25/8622b1688250cb619f3c9ccaefb7eb
#./qux.txt
#./bar.txt.dvc
#./bar.txt
#./qux.txt.dvc
find . | egrep "b07384d|bar.txt|8622b|qux.txt" | wc -l
# 6
# This should do nothing
dvc gc --workspace
find . | egrep "b07384d|bar.txt|8622b|qux.txt" | wc -l
# 6
# Goal: remove bar.txt in shell, purge cache of unreferenced data
# Desired behaviour: command should clear bar.txt's entry in cache, but leave qux.txt's entry
# Let's try removing bar.txt
rm bar.txt
dvc gc --workspace
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./.dvc/cache/files/md5/d3/b07384d113edec49eaa6238ad5ff00
#./.dvc/cache/files/md5/25/8622b1688250cb619f3c9ccaefb7eb
#./qux.txt
#./bar.txt.dvc
#./qux.txt.dvc
# This doesn't remove the cache entry of bar.txt
# Let's try removing bar.txt.dvc
rm bar.txt.dvc
dvc gc --workspace
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./.dvc/cache/files/md5/25/8622b1688250cb619f3c9ccaefb7eb
#./qux.txt
#./qux.txt.dvc
# This removes the cache entry of bar.txt
#
# Conclusion: lack of bar.txt.dvc is what is interpreted as "not in the workspace"
# What about --not-in-remote?
#
# Restore workspace
git restore bar.txt.dvc
dvc pull
# This removes cache entries of things in remote, but leaves workspace instances
dvc pull bar.txt
dvc gc --workspace --not-in-remote
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./qux.txt
#./bar.txt.dvc
#./bar.txt
#./qux.txt.dvc
Thanks @rgoya, I have added this feature, an exclusive flag unused-cache. It will remove all items in the cache that are not currently checked out.
I made it exlcusive, otherwise the table of interactions with the other arguments in purge can get confusing to reason about (e.g., users might expect that the order of arguments influences when things are removed).