dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc and git does not behave the same with "!" and "**"

Open Danial-Alh opened this issue 4 years ago • 19 comments

Consider the following project structure

  • data
    • data1
      • file1
      • file1.dvc
    • data2
      • file2
      • file2.dvc
  • .gitignore

.gitignore is as follows:

data/**
!data/*/
!*.dvc

git status gives: image

while dvc push gives: image

I expect to git and dvc behave the same with gitignore.

  • dvc: 2.8.3
  • python: 3.7

Danial-Alh avatar Nov 18 '21 09:11 Danial-Alh

Your .dvc files are actually still ignored by those .gitignore patterns, the difference here is that you have explicitly staged them in git (presumably with git add -f). You can verify this yourself by doing

git check-ignore --no-index data/data1/file.dvc

DVC only checks the actual .gitignore patterns, and does not currently check if ignored files have been explicitly (force) staged in git (see https://github.com/iterative/dvc/issues/6291)

To make DVC work properly here, your .gitignore should contain:

/data/**
!/data/**/
!/data/**/*.dvc

pmrowla avatar Nov 18 '21 09:11 pmrowla

Thanks for your quick response.

That .gitignore worked. Thanks a lot.

However, I didn't add the files to git with --force option. I checked the git check-ignore --no-index data/data1/file1.dvc. It gives no output.

Danial-Alh avatar Nov 18 '21 09:11 Danial-Alh

$ cat .gitignore
data/**
!data/*/
!*.dvc
$ git check-ignore -v data/data1/a.dvc
.gitignore:3:!*.dvc	data/data1/a.dvc

And for dulwich

$ dulwich check-ignore data/data1
data/data1
$ dulwich check-ignore data/data1/file1.dvc
$

I think they both work correctly for this .gitignore file

karajan1001 avatar Nov 18 '21 10:11 karajan1001

If I have understood correctly, .dvc files are ignored in neither of my nor dulwich's format; but I don't get why dvc does not see .dvc files in my format.

Danial-Alh avatar Nov 18 '21 10:11 Danial-Alh

for dulwich format I get this:

image

the folder is not ignored.

Danial-Alh avatar Nov 18 '21 10:11 Danial-Alh

There is a problem with dulwich format which the file1 is not ignored by git.

image

I want all files in data folder to be ignored by git, except .dvc files.

Danial-Alh avatar Nov 18 '21 10:11 Danial-Alh

There is a problem with dulwich format which the file1 is not ignored by git.

image

I want all files in data folder to be ignored by git, except .dvc files.

It is ignored in my computer

$ git check-ignore -v data/data1/a.dvc
.gitignore:3:!/data/**/*.dvc	data/data1/a.dvc
$ git check-ignore -v data/data1/a
.gitignore:1:/data/**	data/data1/a

Could you please share your .gitignore contents? it has at least 6 lines from your picture.

karajan1001 avatar Nov 18 '21 11:11 karajan1001

Yes I have both formats and commented mine :)

image

Danial-Alh avatar Nov 18 '21 11:11 Danial-Alh

can it be a problem with my git version? mine: git version 2.34.0

Danial-Alh avatar Nov 18 '21 11:11 Danial-Alh

Quite weird here. image let me try to upgrade to a newer version.

karajan1001 avatar Nov 18 '21 11:11 karajan1001

I guess it is a bug of Git

image

Didn't see any thing related to gitignore algorithm in both 2.34.0 and 2.33.1

karajan1001 avatar Nov 18 '21 12:11 karajan1001

Not yet. However, I've just sent an email to git mailing list, describing the issue.

Danial-Alh avatar Nov 18 '21 13:11 Danial-Alh

Here is the thread.

Danial-Alh avatar Nov 18 '21 17:11 Danial-Alh

Sounds like the previous behavior is actually a bug, and had been fixed in some recent release?

karajan1001 avatar Nov 19 '21 07:11 karajan1001

Whether it is was a bug or a bug fix, some commits reverted, and a test case added :)

By the way, I think there is a separate issue with dvc.

In our testcase, if we re-include 'data1' directory by !data/*/, dvc ignores .dvc files inside data1; but if it is done by !data/**/, dvc behaves as expected.

In either of cases, .dvc files inside data1 directory are not ignored by git and the check-ignore output is as follows:

$ git check-ignore -v data/data1/file1.dvc
.gitignore:3:!/data/**/*.dvc   data/data1/file1.dvc

I used another git version, 2.17.1.

Danial-Alh avatar Nov 19 '21 17:11 Danial-Alh

Sorry for late reply.

in our testcase, if we re-include 'data1' directory by !data/*/, dvc ignores .dvc files inside data1 I add some debug code to dvc and tried two examples, In dvc push:

$ dvc push
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/ ignore status is True
Everything is up to date.

While in dvc add data/data2/b

$ dvc add data/data2/b
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/ ignore status is True
Adding...                                                                                                                                                                                                           /Users/gao/Code/test/ignore/data/data2/b ignore status is True
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 31.69file/s]

To track the changes with git, run:

	git add data/data2/b.dvc

To enable auto staging, run:

	dvc config core.autostage true

And if we change !data/*/ to !data/**/

$ dvc push
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/ ignore status is False
/Users/gao/Code/test/ignore/data/data1/ ignore status is False
/Users/gao/Code/test/ignore/data/data2/ ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
3 files pushed

So I guess there are two problems:

  1. Our backends Dulwich gives the different results with
# with `!data/*/`
$ dulwich check-ignore data/
data/
# with `!data/**/`
$ dulwich check-ignore data/

While for the Git:

# with `!data/*/`
$ git check-ignore data/
$
# with `!data/**/`
$ git check-ignore data/
$

They give the same result.

  1. DVC has a different logic in different commands (add work properly while push and commit are not)

And for the logic of gitignore, the following from the thread is quite clear I think

  • Git opens and reads the working tree directory. For each file or directory that is actually present here, Git checks it against the ignore rules. Some rules match only directories and others match both directories and files. Some rules say "do ignore" and some say "do not ignore".

  • The last applicable rule wins.

  • If this is a file and the file is ignored, it's ignored. Unless, that is, it's in the index already, because then it's tracked and can't be ignored.

  • If this is a directory and the directory is ignored, it's not even opened and read. It's not in the index because directories are never in the index (at least nominally). If it is opened and read, the entire set of rules here apply recursively.

karajan1001 avatar Nov 25 '21 08:11 karajan1001

I hit the same issue, and it was painful to find why git was fine with my patterns and dvc was not. I think it is very common to have data in a folder entirely gitignored so handling correctly negation in subfolders to be able to track the .dvc files should be a priority.

mpizenberg avatar Aug 18 '23 13:08 mpizenberg

@pmrowla Do we need to open a dulwich issue for this?

dberenbaum avatar Aug 18 '23 14:08 dberenbaum

I think I am encountering the same bug. Here's is what I have:

I have a global .gitignore file in my home directory with

*.lock

In my dvc managed repo I have:

!dvc.lock

Git behaves as expected (it does not ignore the file), but dvc treats it as ignored...

I will post add'l info on: https://github.com/jelmer/dulwich/issues/1203

ptMcGit avatar May 29 '25 19:05 ptMcGit

https://github.com/jelmer/dulwich/issues/1203 has been fixed in the latest release (https://github.com/jelmer/dulwich/releases/tag/dulwich-0.23.0).

skshetry avatar Aug 02 '25 04:08 skshetry