dvc
dvc copied to clipboard
`exp run`: git ignored files committed to experiments
Bug Report
Description
I was working on a new command in the extension where you can "share an experiment". I got some surprising results in that the experiment's input data that is tracked by DVC and listed in .gitignore files was committed into Git. I discovered this when I sent the branch to the remote and opened a PR.
Reproduce
- run several checkpoint experiments in the workspace.
- pick one of the experiments.
- dvc exp branch [exp-name] [branch-name]
- dvc exp apply [exp-name]
- dvc push
- git push origin [branch-name]
- create a PR for the branch pushed to origin. Tracked data will be shown in the PR. e.g https://github.com/iterative/vscode-dvc/pull/2156
Expected
exp branch does not commit DVC tracked/git ignored data.
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 2.16.0 (pip)
---------------------------------
Platform: Python 3.8.9 on macOS-12.5-arm64-arm-64bit
Supports:
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git
Additional Information (if any):
Could definitely be related to the way a branch is created for checkpoint experiments. I have not tested to see if the same problem occurs with non-checkpoint experiments.
To clarify, I don't think the issue here is with exp branch. The problem here is that DVC-tracked/git-ignored files are being included in the commits generated by exp run. exp branch does not generate git commits, it just creates a new git branch ref (refs/heads/... that points to the existing experiment commit)
This bug just became more apparent due to using exp branch and pushing to github (and the unexpected files showing up in the github PR UI)
This issue was also noted by a discord user, but it was unclear on how to reproduce it and not followed up into a separate bug report
https://discord.com/channels/485586884165107732/485596304961962003/1003959217826300025
Thanks @pmrowla
still bisecting, but it looks like this bug was introduced in 2.11.0
Using git checkout <exp-commit> from example-get-started runs:
2.10.2
example-get-started git:e5e08a2 py:example-get-started ❯ git ls-files
.dvc/.gitignore
.dvc/config
.dvc/plots/confusion.json
.dvc/plots/confusion_normalized.json
.dvc/plots/default.json
.dvc/plots/linear.json
.dvc/plots/scatter.json
.dvc/plots/smooth.json
.dvcignore
.github/workflows/cml.yaml
.gitignore
README.md
data/.gitignore
data/data.xml.dvc
dvc.lock
dvc.yaml
params.yaml
prc.json
roc.json
scores.json
src/evaluate.py
src/featurization.py
src/prepare.py
src/requirements.txt
src/train.py
2.11.0 (data/data.xml, data/features/, data/prepared/) have all been committed in git
example-get-started git:48306b2 py:example-get-started ❯ git ls-files
.dvc/.gitignore
.dvc/config
.dvc/plots/confusion.json
.dvc/plots/confusion_normalized.json
.dvc/plots/default.json
.dvc/plots/linear.json
.dvc/plots/scatter.json
.dvc/plots/smooth.json
.dvcignore
.github/workflows/cml.yaml
.gitignore
README.md
data/.gitignore
data/data.xml
data/data.xml.dvc
data/features/test.pkl
data/features/train.pkl
data/prepared/test.tsv
data/prepared/train.tsv
dvc.lock
dvc.yaml
model.pkl
params.yaml
prc.json
roc.json
scores.json
src/evaluate.py
src/featurization.py
src/prepare.py
src/requirements.txt
src/train.py
Bug was introduced with https://github.com/iterative/dvc/pull/7712
When we reproduce the exp pipeline, we ensure that all deps and outputs are tracked in either DVC or git. So the end result is that all non-DVC-tracked deps are staged in git.
The check to determine whether or not a dep is already tracked by DVC uses dep.fs_path, which is no longer valid after the repo root change. The bug happens because the fs path will not be identified as a DVC file, so we end up force staging all pipeline deps in git.
https://github.com/iterative/dvc/blob/c134d4ab8e9fb2a56ffefeabbc5b636fb91a7b51/dvc/repo/reproduce.py#L61-L69
After the root fs change, we need to checking the dep path relative to repo root and not the local filesystem path
@pmrowla seems like this is still happening for checkpoint experiments on the latest version: https://github.com/iterative/vscode-dvc/pull/2210/commits/02256ae7eff7b5022754f0b74e1a24bed76cdc96
$ dvc doctor
DVC version: 2.18.1 (pip)
---------------------------------
Platform: Python 3.8.9 on macOS-12.5-arm64-arm-64bit
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
s3 (s3fs = 2022.5.0, boto3 = 1.21.21),
webhdfs (fsspec = 2022.5.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git
Or is there something that I need to do to stop this from happening?
Let me know if you need any more info.
Checkout and remove
~/projects/vscode-dvc/demo share-experiment *1 ❯ git checkout 02256ae7eff7b5022754f0b74e1a24bed76cdc96
Note: switching to '02256ae7eff7b5022754f0b74e1a24bed76cdc96'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 02256ae7 dvc: commit experiment 69f54b895641ce4f3adf5ec15cde550e451167acde6c9467aeb6d3202906b727
~/projects/vscode-dvc/demo @02256ae7 *1 ❯ git rm -r --cached 'data/MNIST/raw'
rm 'demo/data/MNIST/raw/t10k-images-idx3-ubyte'
rm 'demo/data/MNIST/raw/t10k-images-idx3-ubyte.gz'
rm 'demo/data/MNIST/raw/t10k-labels-idx1-ubyte'
rm 'demo/data/MNIST/raw/t10k-labels-idx1-ubyte.gz'
rm 'demo/data/MNIST/raw/train-images-idx3-ubyte'
rm 'demo/data/MNIST/raw/train-images-idx3-ubyte.gz'
rm 'demo/data/MNIST/raw/train-labels-idx1-ubyte'
rm 'demo/data/MNIST/raw/train-labels-idx1-ubyte.gz'
I can reproduce it, the issue here is that the demo stage dependency is
deps:
- data/MNIST
but the DVC (data) output is data/MNIST/raw, so we still end up explicitly git-adding data/MNIST without doing any further gitignore checks.
If I change the pipeline stage to use
deps:
- data/MNIST/raw
I get the expected behavior, with no DVC-tracked data being added in git.
I think for the demo project the pipeline stage dependency should be data/MNIST/raw, since the dependency is supposed to be on that data directory (data/MNIST/raw/) and not the data directory plus anything else in data/MNIST (which currently includes raw.dvc and .gitignore).
We do still have a bug in how exps handle combined git + DVC directory deps, but it probably needs to be addressed in scmrepo (and I don't think needs to be p0). The issue is that scm.add() has always been git add --force, but we are definitely at the point where we need a proper distinction between scm.add(force=True) and force=False.
e: opened https://github.com/iterative/scmrepo/issues/123
I found this same issue and I'm running with similar workflow as described in the origin report.
DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.4.0-122-generic-x86_64-with-glibc2.29
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
webdav (webdav4 = 0.9.7),
webdavs (webdav4 = 0.9.7)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-root
Caches: local
Remotes: webdavs, webdavs
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-root
Repo: dvc (subdir), git
Is there a fix, i.e., workaround for this?
EDIT
I have a dvc stage ("train") with the following outs:
outs:
- results/train/checkpoints
- results/train/config.py:
cache: false
- results/train/train.log:
cache: false
when I run dvc exp run and turn this into a branch (dvc exp branch) the checkpoints folder is committed to git. The push fails because checkpoints are over 100 MB. That's how I noticed.
The .gitignore is added with correct path and when I remove the checkpoints from git and use git status --ignored I see the checkpoints folder appear in the list. This happens for a file as well (not specific to directories).
I'm currently first creating a branch, change the parameters, and run the experiment. After which I commit it to the branch...
@rick-van-veen Can you describe more of your issue or how to reproduce it?
No longer an issue for me.