dvc
dvc copied to clipboard
dvc and git does not behave the same with "!" and "**"
Consider the following project structure
- data
- data1
- file1
- file1.dvc
- data2
- file2
- file2.dvc
- data1
- .gitignore
.gitignore is as follows:
data/**
!data/*/
!*.dvc
git status gives:

while dvc push gives:

I expect to git and dvc behave the same with gitignore.
- dvc: 2.8.3
- python: 3.7
Your .dvc files are actually still ignored by those .gitignore patterns, the difference here is that you have explicitly staged them in git (presumably with git add -f). You can verify this yourself by doing
git check-ignore --no-index data/data1/file.dvc
DVC only checks the actual .gitignore patterns, and does not currently check if ignored files have been explicitly (force) staged in git (see https://github.com/iterative/dvc/issues/6291)
To make DVC work properly here, your .gitignore should contain:
/data/**
!/data/**/
!/data/**/*.dvc
Thanks for your quick response.
That .gitignore worked.
Thanks a lot.
However, I didn't add the files to git with --force option.
I checked the git check-ignore --no-index data/data1/file1.dvc. It gives no output.
$ cat .gitignore
data/**
!data/*/
!*.dvc
$ git check-ignore -v data/data1/a.dvc
.gitignore:3:!*.dvc data/data1/a.dvc
And for dulwich
$ dulwich check-ignore data/data1
data/data1
$ dulwich check-ignore data/data1/file1.dvc
$
I think they both work correctly for this .gitignore file
If I have understood correctly, .dvc files are ignored in neither of my nor dulwich's format; but I don't get why dvc does not see .dvc files in my format.
for dulwich format I get this:

the folder is not ignored.
There is a problem with dulwich format which the file1 is not ignored by git.

I want all files in data folder to be ignored by git, except .dvc files.
There is a problem with dulwich format which the
file1is not ignored by git.
I want all files in
datafolder to be ignored by git, except.dvcfiles.
It is ignored in my computer
$ git check-ignore -v data/data1/a.dvc
.gitignore:3:!/data/**/*.dvc data/data1/a.dvc
$ git check-ignore -v data/data1/a
.gitignore:1:/data/** data/data1/a
Could you please share your .gitignore contents? it has at least 6 lines from your picture.
Yes I have both formats and commented mine :)

can it be a problem with my git version?
mine: git version 2.34.0
Quite weird here.
let me try to upgrade to a newer version.
I guess it is a bug of Git

Didn't see any thing related to gitignore algorithm in both 2.34.0 and 2.33.1
Not yet. However, I've just sent an email to git mailing list, describing the issue.
Here is the thread.
Sounds like the previous behavior is actually a bug, and had been fixed in some recent release?
Whether it is was a bug or a bug fix, some commits reverted, and a test case added :)
By the way, I think there is a separate issue with dvc.
In our testcase, if we re-include 'data1' directory by !data/*/, dvc ignores .dvc files inside data1; but if it is done by !data/**/, dvc behaves as expected.
In either of cases, .dvc files inside data1 directory are not ignored by git and the check-ignore output is as follows:
$ git check-ignore -v data/data1/file1.dvc
.gitignore:3:!/data/**/*.dvc data/data1/file1.dvc
I used another git version, 2.17.1.
Sorry for late reply.
in our testcase, if we re-include 'data1' directory by !data/*/, dvc ignores .dvc files inside data1 I add some debug code to dvc and tried two examples, In
dvc push:
$ dvc push
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/ ignore status is True
Everything is up to date.
While in dvc add data/data2/b
$ dvc add data/data2/b
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/ ignore status is True
Adding... /Users/gao/Code/test/ignore/data/data2/b ignore status is True
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 31.69file/s]
To track the changes with git, run:
git add data/data2/b.dvc
To enable auto staging, run:
dvc config core.autostage true
And if we change !data/*/ to !data/**/
$ dvc push
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/ ignore status is False
/Users/gao/Code/test/ignore/data/data1/ ignore status is False
/Users/gao/Code/test/ignore/data/data2/ ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
3 files pushed
So I guess there are two problems:
- Our backends Dulwich gives the different results with
# with `!data/*/`
$ dulwich check-ignore data/
data/
# with `!data/**/`
$ dulwich check-ignore data/
While for the Git:
# with `!data/*/`
$ git check-ignore data/
$
# with `!data/**/`
$ git check-ignore data/
$
They give the same result.
- DVC has a different logic in different commands (
addwork properly whilepushandcommitare not)
And for the logic of gitignore, the following from the thread is quite clear I think
-
Git opens and reads the working tree directory. For each file or directory that is actually present here, Git checks it against the ignore rules. Some rules match only directories and others match both directories and files. Some rules say "do ignore" and some say "do not ignore".
-
The last applicable rule wins.
-
If this is a file and the file is ignored, it's ignored. Unless, that is, it's in the index already, because then it's tracked and can't be ignored.
-
If this is a directory and the directory is ignored, it's not even opened and read. It's not in the index because directories are never in the index (at least nominally). If it is opened and read, the entire set of rules here apply recursively.
I hit the same issue, and it was painful to find why git was fine with my patterns and dvc was not. I think it is very common to have data in a folder entirely gitignored so handling correctly negation in subfolders to be able to track the .dvc files should be a priority.
@pmrowla Do we need to open a dulwich issue for this?
I think I am encountering the same bug. Here's is what I have:
I have a global .gitignore file in my home directory with
*.lock
In my dvc managed repo I have:
!dvc.lock
Git behaves as expected (it does not ignore the file), but dvc treats it as ignored...
I will post add'l info on: https://github.com/jelmer/dulwich/issues/1203
https://github.com/jelmer/dulwich/issues/1203 has been fixed in the latest release (https://github.com/jelmer/dulwich/releases/tag/dulwich-0.23.0).