Remove all locally downloaded data
It could be that this feature already exists, but I've not been able to find it.
My usecase is that I want to remove all locally downloaded data in a git repo (i.e., everything referenced by .dvc files, and optionally, the .dvc/cache).
dvc destroy does not do what I want, as it also deletes the .dvc files.
My use-case is I want to do a weekend CI run on my repo, making sure that I'm downloading all DVC data fresh from my cloud bucket (just in case there are any issues).
@Wheest would dvc gc work for you - https://dvc.org/doc/command-reference/gc#gc ?
@shcheklein Thanks! From what I understand, dvc gc only ever removes cache objects.
Even with flags like -w / --workspace, --all-commits, etc., it still only prunes objects in .dvc/cache - it never deletes the corresponding copies in the workspace.
What I was looking for is slightly different:
- delete all items from the cache that are referenced by the current commit
- delete all copies of those items in the current workspace (since they are reflinks/hardlinks, just deleting cache isn’t enough)
In a sense, I'd like to be able to get to the state where I've just done a git clone of the repo, but no dvc pull yet. This tests that my DVC repo and related pipelines still work from a fresh start.
dvc destroy is too destructive (removes .dvc files/metadata).
dvc gc is too conservative (only cache).
I hacked together a script that parses .dvc files and removes the declared outs paths, plus optionally .dvc/cache. It works for CI (where I want to make sure all data is re-pulled fresh from the remote), but it feels like something DVC could handle natively.
Bash script
#!/usr/bin/env bash
# Remove all DVC-tracked files in the current directory and its subdirectories
#
# Optionally take --remove-cache to also remove the .dvc/cache directory
# Usage: ./dvc_nuke.sh [--remove-cache]
set -euo pipefail
if [[ "${1:-}" == "--remove-cache" ]]; then
echo "Removing .dvc/cache directory"
rm -rf .dvc/cache
fi
# Use find -print0 to safely handle weird filenames
find . -type f -name '*.dvc' -print0 |
while IFS= read -r -d '' f; do
target=$(yq '.outs[].path' "$f" 2>/dev/null || true)
if [[ -n "$target" ]]; then
dir=$(dirname "$f")
fullpath="$dir/$target"
echo "Removing $fullpath"
# safety guard: bail if fullpath is empty or root
if [[ -z "$fullpath" || "$fullpath" == "/" ]]; then
echo "Refusing to delete suspicious path: '$fullpath'" >&2
exit 1
fi
rm -rf -- "$fullpath"
fi
done
Maybe a dvc gc --workspace-data (or a new dvc purge-local-data) command would make sense for this use case?
I have a very similar use case. I work in a large mono repo and we have dvc repos in multiple folders. When working on an analysis I'll be pulling dvc data, creating new data, dvc pushing, etc.
When I'm done, I would like to reclaim my disk space and clear the data in a safe manner. That is, for each file in the workspace, or in a list of files, if the current workspace file is already in the remote, delete the cache copy and the workspace copy.
Maybe a
dvc gc --workspace-data(or a newdvc purge-local-data) command would make sense for this use case?
These sound about right.
Scoping out what this might look like, let's say CLI is:
dvc purge [--recursive] [targets...]
Options:
--recursive-> descend into directories when purging.targets...-> list of specific files/directories to purge, e.g..,data/, etc.--dry-run-> show what would be removed, without deleting anything.--force-> bypass the safety check and purge regardless.
Behaviour
-
Identify all outputs (outs) referenced in
.dvcfiles for the given scope. -
Remove:
-
Cache objects in
.dvc/cachecorresponding to those outputs. -
Workspace copies (regular files, reflinks, or hardlinks).
- Ensure:
-
Metadata remains intact.
-
Non-DVC files are untouched.
-
Partial purging (via targets) works safely.
Safety
Before deleting, dvc purge should check whether any DVC-tracked outs differ from cache.
If differences exist:
-
Abort purge by default.
-
Print a message like:
ERROR: Some tracked outputs have uncommitted changes.
Use `--force` to purge anyway.
I've sketched out a PoC of this feature in #10880