dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Remove all locally downloaded data

Open Wheest opened this issue 3 months ago • 5 comments

It could be that this feature already exists, but I've not been able to find it.

My usecase is that I want to remove all locally downloaded data in a git repo (i.e., everything referenced by .dvc files, and optionally, the .dvc/cache).

dvc destroy does not do what I want, as it also deletes the .dvc files.

My use-case is I want to do a weekend CI run on my repo, making sure that I'm downloading all DVC data fresh from my cloud bucket (just in case there are any issues).

Wheest avatar Sep 26 '25 14:09 Wheest

@Wheest would dvc gc work for you - https://dvc.org/doc/command-reference/gc#gc ?

shcheklein avatar Sep 26 '25 18:09 shcheklein

@shcheklein Thanks! From what I understand, dvc gc only ever removes cache objects.

Even with flags like -w / --workspace, --all-commits, etc., it still only prunes objects in .dvc/cache - it never deletes the corresponding copies in the workspace.

What I was looking for is slightly different:

  • delete all items from the cache that are referenced by the current commit
  • delete all copies of those items in the current workspace (since they are reflinks/hardlinks, just deleting cache isn’t enough)

In a sense, I'd like to be able to get to the state where I've just done a git clone of the repo, but no dvc pull yet. This tests that my DVC repo and related pipelines still work from a fresh start.

dvc destroy is too destructive (removes .dvc files/metadata).
dvc gc is too conservative (only cache).

I hacked together a script that parses .dvc files and removes the declared outs paths, plus optionally .dvc/cache. It works for CI (where I want to make sure all data is re-pulled fresh from the remote), but it feels like something DVC could handle natively.

Bash script

#!/usr/bin/env bash
# Remove all DVC-tracked files in the current directory and its subdirectories
#
# Optionally take --remove-cache to also remove the .dvc/cache directory
# Usage: ./dvc_nuke.sh [--remove-cache]

set -euo pipefail

if [[ "${1:-}" == "--remove-cache" ]]; then
  echo "Removing .dvc/cache directory"
  rm -rf .dvc/cache
fi

# Use find -print0 to safely handle weird filenames
find . -type f -name '*.dvc' -print0 |
while IFS= read -r -d '' f; do
  target=$(yq '.outs[].path' "$f" 2>/dev/null || true)
  if [[ -n "$target" ]]; then
    dir=$(dirname "$f")
    fullpath="$dir/$target"
    echo "Removing $fullpath"

    # safety guard: bail if fullpath is empty or root
    if [[ -z "$fullpath" || "$fullpath" == "/" ]]; then
      echo "Refusing to delete suspicious path: '$fullpath'" >&2
      exit 1
    fi

    rm -rf -- "$fullpath"
  fi
done

Maybe a dvc gc --workspace-data (or a new dvc purge-local-data) command would make sense for this use case?

Wheest avatar Sep 29 '25 09:09 Wheest

I have a very similar use case. I work in a large mono repo and we have dvc repos in multiple folders. When working on an analysis I'll be pulling dvc data, creating new data, dvc pushing, etc.

When I'm done, I would like to reclaim my disk space and clear the data in a safe manner. That is, for each file in the workspace, or in a list of files, if the current workspace file is already in the remote, delete the cache copy and the workspace copy.

Maybe a dvc gc --workspace-data (or a new dvc purge-local-data) command would make sense for this use case?

These sound about right.

rgoya avatar Oct 02 '25 04:10 rgoya

Scoping out what this might look like, let's say CLI is:

dvc purge [--recursive] [targets...]

Options:

  • --recursive -> descend into directories when purging.
  • targets... -> list of specific files/directories to purge, e.g. ., data/, etc.
  • --dry-run -> show what would be removed, without deleting anything.
  • --force -> bypass the safety check and purge regardless.

Behaviour

  1. Identify all outputs (outs) referenced in .dvc files for the given scope.

  2. Remove:

  • Cache objects in .dvc/cache corresponding to those outputs.

  • Workspace copies (regular files, reflinks, or hardlinks).

  1. Ensure:
  • Metadata remains intact.

  • Non-DVC files are untouched.

  • Partial purging (via targets) works safely.

Safety

Before deleting, dvc purge should check whether any DVC-tracked outs differ from cache.

If differences exist:

  • Abort purge by default.

  • Print a message like:

ERROR: Some tracked outputs have uncommitted changes.
Use `--force` to purge anyway.

Wheest avatar Oct 02 '25 20:10 Wheest

I've sketched out a PoC of this feature in #10880

Wheest avatar Oct 03 '25 09:10 Wheest