dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`exp run`: git ignored files committed to experiments

Open mattseddon opened this issue 3 years ago • 9 comments

Bug Report

Description

I was working on a new command in the extension where you can "share an experiment". I got some surprising results in that the experiment's input data that is tracked by DVC and listed in .gitignore files was committed into Git. I discovered this when I sent the branch to the remote and opened a PR.

Reproduce

  1. run several checkpoint experiments in the workspace.
  2. pick one of the experiments.
  3. dvc exp branch [exp-name] [branch-name]
  4. dvc exp apply [exp-name]
  5. dvc push
  6. git push origin [branch-name]
  7. create a PR for the branch pushed to origin. Tracked data will be shown in the PR. e.g https://github.com/iterative/vscode-dvc/pull/2156

Expected

exp branch does not commit DVC tracked/git ignored data.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.16.0 (pip)
---------------------------------
Platform: Python 3.8.9 on macOS-12.5-arm64-arm-64bit
Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Additional Information (if any):

Could definitely be related to the way a branch is created for checkpoint experiments. I have not tested to see if the same problem occurs with non-checkpoint experiments.

mattseddon avatar Aug 09 '22 06:08 mattseddon

To clarify, I don't think the issue here is with exp branch. The problem here is that DVC-tracked/git-ignored files are being included in the commits generated by exp run. exp branch does not generate git commits, it just creates a new git branch ref (refs/heads/... that points to the existing experiment commit)

This bug just became more apparent due to using exp branch and pushing to github (and the unexpected files showing up in the github PR UI)

pmrowla avatar Aug 09 '22 09:08 pmrowla

This issue was also noted by a discord user, but it was unclear on how to reproduce it and not followed up into a separate bug report

https://discord.com/channels/485586884165107732/485596304961962003/1003959217826300025

pmrowla avatar Aug 09 '22 09:08 pmrowla

Thanks @pmrowla

mattseddon avatar Aug 09 '22 12:08 mattseddon

still bisecting, but it looks like this bug was introduced in 2.11.0

Using git checkout <exp-commit> from example-get-started runs:

2.10.2

example-get-started git:e5e08a2  py:example-get-started ❯ git ls-files
.dvc/.gitignore
.dvc/config
.dvc/plots/confusion.json
.dvc/plots/confusion_normalized.json
.dvc/plots/default.json
.dvc/plots/linear.json
.dvc/plots/scatter.json
.dvc/plots/smooth.json
.dvcignore
.github/workflows/cml.yaml
.gitignore
README.md
data/.gitignore
data/data.xml.dvc
dvc.lock
dvc.yaml
params.yaml
prc.json
roc.json
scores.json
src/evaluate.py
src/featurization.py
src/prepare.py
src/requirements.txt
src/train.py

2.11.0 (data/data.xml, data/features/, data/prepared/) have all been committed in git

example-get-started git:48306b2  py:example-get-started ❯ git ls-files
.dvc/.gitignore
.dvc/config
.dvc/plots/confusion.json
.dvc/plots/confusion_normalized.json
.dvc/plots/default.json
.dvc/plots/linear.json
.dvc/plots/scatter.json
.dvc/plots/smooth.json
.dvcignore
.github/workflows/cml.yaml
.gitignore
README.md
data/.gitignore
data/data.xml
data/data.xml.dvc
data/features/test.pkl
data/features/train.pkl
data/prepared/test.tsv
data/prepared/train.tsv
dvc.lock
dvc.yaml
model.pkl
params.yaml
prc.json
roc.json
scores.json
src/evaluate.py
src/featurization.py
src/prepare.py
src/requirements.txt
src/train.py

pmrowla avatar Aug 10 '22 04:08 pmrowla

Bug was introduced with https://github.com/iterative/dvc/pull/7712

When we reproduce the exp pipeline, we ensure that all deps and outputs are tracked in either DVC or git. So the end result is that all non-DVC-tracked deps are staged in git.

The check to determine whether or not a dep is already tracked by DVC uses dep.fs_path, which is no longer valid after the repo root change. The bug happens because the fs path will not be identified as a DVC file, so we end up force staging all pipeline deps in git.

https://github.com/iterative/dvc/blob/c134d4ab8e9fb2a56ffefeabbc5b636fb91a7b51/dvc/repo/reproduce.py#L61-L69

After the root fs change, we need to checking the dep path relative to repo root and not the local filesystem path

pmrowla avatar Aug 10 '22 05:08 pmrowla

@pmrowla seems like this is still happening for checkpoint experiments on the latest version: https://github.com/iterative/vscode-dvc/pull/2210/commits/02256ae7eff7b5022754f0b74e1a24bed76cdc96

$ dvc doctor
DVC version: 2.18.1 (pip)
---------------------------------
Platform: Python 3.8.9 on macOS-12.5-arm64-arm-64bit
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21),
        webhdfs (fsspec = 2022.5.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Or is there something that I need to do to stop this from happening?

Let me know if you need any more info.

Checkout and remove
~/projects/vscode-dvc/demo share-experiment *1 ❯ git checkout 02256ae7eff7b5022754f0b74e1a24bed76cdc96
Note: switching to '02256ae7eff7b5022754f0b74e1a24bed76cdc96'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 02256ae7 dvc: commit experiment 69f54b895641ce4f3adf5ec15cde550e451167acde6c9467aeb6d3202906b727
~/projects/vscode-dvc/demo @02256ae7 *1 ❯ git rm -r --cached 'data/MNIST/raw'
rm 'demo/data/MNIST/raw/t10k-images-idx3-ubyte'
rm 'demo/data/MNIST/raw/t10k-images-idx3-ubyte.gz'
rm 'demo/data/MNIST/raw/t10k-labels-idx1-ubyte'
rm 'demo/data/MNIST/raw/t10k-labels-idx1-ubyte.gz'
rm 'demo/data/MNIST/raw/train-images-idx3-ubyte'
rm 'demo/data/MNIST/raw/train-images-idx3-ubyte.gz'
rm 'demo/data/MNIST/raw/train-labels-idx1-ubyte'
rm 'demo/data/MNIST/raw/train-labels-idx1-ubyte.gz'

mattseddon avatar Aug 18 '22 05:08 mattseddon

I can reproduce it, the issue here is that the demo stage dependency is

    deps:
    - data/MNIST

but the DVC (data) output is data/MNIST/raw, so we still end up explicitly git-adding data/MNIST without doing any further gitignore checks.

If I change the pipeline stage to use

    deps:
    - data/MNIST/raw

I get the expected behavior, with no DVC-tracked data being added in git.


I think for the demo project the pipeline stage dependency should be data/MNIST/raw, since the dependency is supposed to be on that data directory (data/MNIST/raw/) and not the data directory plus anything else in data/MNIST (which currently includes raw.dvc and .gitignore).

We do still have a bug in how exps handle combined git + DVC directory deps, but it probably needs to be addressed in scmrepo (and I don't think needs to be p0). The issue is that scm.add() has always been git add --force, but we are definitely at the point where we need a proper distinction between scm.add(force=True) and force=False.

e: opened https://github.com/iterative/scmrepo/issues/123

pmrowla avatar Aug 18 '22 06:08 pmrowla

I found this same issue and I'm running with similar workflow as described in the origin report.

DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.4.0-122-generic-x86_64-with-glibc2.29
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-root
Caches: local
Remotes: webdavs, webdavs
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-root
Repo: dvc (subdir), git

Is there a fix, i.e., workaround for this?

EDIT

I have a dvc stage ("train") with the following outs:

outs:
      - results/train/checkpoints
      - results/train/config.py:
          cache: false
      - results/train/train.log:
          cache: false

when I run dvc exp run and turn this into a branch (dvc exp branch) the checkpoints folder is committed to git. The push fails because checkpoints are over 100 MB. That's how I noticed.

The .gitignore is added with correct path and when I remove the checkpoints from git and use git status --ignored I see the checkpoints folder appear in the list. This happens for a file as well (not specific to directories).

I'm currently first creating a branch, change the parameters, and run the experiment. After which I commit it to the branch...

rick-van-veen avatar Sep 02 '22 07:09 rick-van-veen

@rick-van-veen Can you describe more of your issue or how to reproduce it?

dberenbaum avatar Sep 09 '22 17:09 dberenbaum

No longer an issue for me.

mattseddon avatar Jan 27 '23 03:01 mattseddon