Incorrectly (?) matches files inside directories that do match
Hello @cpburnz, I've looked existing issues but couldn't find this one. It was initially reported in yamllint issues: https://github.com/adrienverge/yamllint/issues/334 (yamllint uses pathspec).
Considering these files:
file.yaml
dir/
└── dir/file.sql
dir.yaml/
└── dir.yaml/file.sql ← problem here
pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml']) does match file.yaml, dir.yaml, but also dir.yaml/file.sql. The latter isn't what I would expect, because *.yaml feels like "files that end with .yaml", not "any file in any path that contains .yaml".
Is it the expected behavior? If yes, what is the correct pattern for "files that end with .yaml"?
Looking forward to your answer, and thanks for this very useful project :+1:
Hi @adrienverge,
Due to the current implementation it makes sense to me why dir.yaml/* is matching the pattern *.yaml. A dot in a folder name resembling a file extension is valid. There is not currently a way to tag a pattern to only match against files. This is the expected behavior. The most straight forward way to exclude directories with the .yaml extension is to add a negative pattern for it. E.g.,
p = pathspec.PathSpec.from_lines('gitwildmatch', [
'*.yaml',
'!*.yaml/'
])
p.match_file('file.yaml')
# True
p.match_file('dir.yaml/file.sql')
# False
Please let me know whether this works for you. I realize this may or may not work depending on your use case.
Thanks for your answer @cpburnz! To complete it, I notice that Git + .gitignore behaves the same: if it contains a line *.yaml, it will also ignore files like dir.yaml/file.sql.
The ['*.yaml', '!*.yaml/'] pattern looks the right way to go :+1:
However, while it works with Git and .gitignore, it doesn't seem to work with pathspec. More precisely, with the following tree, .gitignore successfully matches dir/file.yaml and dir.yaml/file.yaml, but pathspec doesn't match dir.yaml/file.yaml.
dir
├── index.txt ← already tracked by git
├── file.sql
└── file.yaml
dir.yaml
├── index.txt ← already tracked by git
├── file.sql
└── file.yaml
Said differently:
p = pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml', '!*.yaml/'])
p.match_file('dir/file.yaml') # True
p.match_file('dir.yaml/file.yaml') # False ← I would expect True (like with .gitignore)
@adrienverge Oops, I overlooked that. This can still be accomplished be adding an addition pattern to reinclude .yaml files under a .yaml directory:
p = pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml', '!*.yaml/', '*.yaml/**/*.yaml'])
p.match_file('dir/file.yaml') # True
p.match_file('dir.yaml/index.txt') # False
p.match_file('dir.yaml/file.yaml') # True
The *.yaml/**/*.yaml pattern is redundant for Git and does not affect its results. The cause for this difference is pathspec strictly implements Git's wildmatch patterns while Git does additional processing for .gitignore such as as tracking included/excluded directories.
Thanks again @cpburnz!
Unfortunately this recursive pattern including/excluding seems to bring new problems:
-
While the first two patterns match paths (e.g.
dir/dir/dir/file.yaml), the third pattern doesn't work. It seems that I need to add**/at the beginning, so it would be:'*.yaml', '!*.yaml/', '**/*.yaml/**/*.yaml', -
With these new patterns, now
dir.yaml/dir.yaml/index.txtwill be matched. We would then need to add two new patterns:... '!**/*.yaml/**/*.yaml/', '**/*.yaml/**/*.yaml/**/*.yaml',... but we can recurse for
dir.yaml/dir.yaml/dir.yaml/index.txt, so we would need an infinite list of patterns. -
The problems is actually even worse, because we need to match different extensions/patterns. By default it's
*.yaml,*.ymland.yamllintbut users can set any pattern using configuration.Which means we would need to add (for the simple use case with just 3 extensions):
'**/*.yaml/**/*.yaml', '!**/*.yaml/**/*.yaml/', '**/*.yml/**/*.yaml', '!**/*.yml/**/*.yaml/', '**/.yamllint/**/*.yaml', '!**/.yamllint/**/*.yaml/', '**/*.yaml/**/*.yml', '!**/*.yaml/**/*.yml/', '**/*.yml/**/*.yml', '!**/*.yml/**/*.yml/', '**/.yamllint/**/*.yml', '!**/.yamllint/**/*.yml/', '**/*.yaml/**/.yamllint', '!**/*.yaml/**/.yamllint/', '**/*.yml/**/.yamllint', '!**/*.yml/**/.yamllint/', '**/.yamllint/**/.yamllint', '!**/.yamllint/**/.yamllint/', ... + infinite recursive patterns from bullet 2.
The cause for this difference is pathspec strictly implements Git's wildmatch patterns while Git does additional processing for
.gitignoresuch as as tracking included/excluded directories.
So I understand that pathspec implements the standard, while Git adds some user-friendly, non-standard processing.
Given the bullets above, and the fact that the need "match all paths whose filename end with *.yaml or *.yml" cannot be achieved without an infinite pattern list, do you think it would be doable to add an option to pathspec? Would it be easy? Can we help in any way?
If the answer is no, are you aware of another library that could achieve this goal?
(PS: I don't know how others use your useful library pathspec, but my guess is that 99% are looking for something that behaves exactly like .gitignore. If I'm wrong I would be curious to know the other use-cases!)
@adrienverge That is indeed problematic without a good solution for current feature set.
So I understand that pathspec implements the standard, while Git adds some user-friendly, non-standard processing.
Precisely
Given the bullets above, and the fact that the need "match all paths whose filename end with *.yaml or *.yml" cannot be achieved without an infinite pattern list, do you think it would be doable to add an option to pathspec? Would it be easy? Can we help in any way?
It's certainly doable. I think the ideal implementation would be a class implementing the .gitignore specific behavior, possibly a subclass of PathSpec. I welcome a pull request implementing it.
If the answer is no, are you aware of another library that could achieve this goal?
The answer isn't no, but I am aware of another library that tries to specifically implement .gitignore behavior: igittigitt. I have no experience with it myself.
(PS: I don't know how others use your useful library pathspec, but my guess is that 99% are looking for something that behaves exactly like .gitignore. If I'm wrong I would be curious to know the other use-cases!)
That seems to be the case. My original use case was I wanted a library with glob-like file matching similar to git or rsync.
I welcome a pull request implementing it.
Thanks again for following this up @cpburnz :+1:
@liopic @dafyddj @jsok you are the ones affected by this (via https://github.com/adrienverge/yamllint/issues/279 and https://github.com/adrienverge/yamllint/issues/334), would you like to contribute to python-pathspec and implement a .gitignore-like subclass of PathSpec?
If not, I can try to find time myself but I don't know precisely when it could be. I already went forward and found this relevant code for do_match_pathspec() in Git source: https://github.com/git/git/blob/d4a3924/dir.c#L432-L501
@cpburnz if you have any pointers or documentation on this "extra processing" that Git does, it would be more than welcome :)
@adrienverge Interesting to read, but I am not a developer so I don't think I can help with this - but good luck!
Guys this is not a bug. Git works exactly as the spec says:
If there is a separator at the end of the pattern then the pattern will only match directories, otherwise the pattern can match both files and directories.
https://git-scm.com/docs/gitignore
@excitoon I think it's an edge case not adequately described in the gitignore specification. Git behaves exactly as described in this issue, and at the moment pathspec behaves differently. I'm working fixing the discrepancy.
@cpburnz You can check how I solved it in my module.
First, making a group for extra path parts: https://github.com/excitoon/gitignorefile/commit/6763d4772bb06965d774aaab2dde5d82f33855a3#diff-21e52cb6c7e0acafac258d280cac806bcf75715e4207b85d451e0c9e88d9f2b5R305-R306
Second, analyzing if anything matched. If it did not, we shall query filesystem to check if it is a directory: https://github.com/excitoon/gitignorefile/commit/6763d4772bb06965d774aaab2dde5d82f33855a3#diff-21e52cb6c7e0acafac258d280cac806bcf75715e4207b85d451e0c9e88d9f2b5R234
One more example:
core
!core/
There will be a way to handle this scenario in the v0.10.0 release. The new class, GitIgnoreSpec, will need to be used. It will prioritize patterns that directly match the file over an ancestor directory. It can be used as:
# The "gitwildmatch" pattern factory is assumed by GitIgnoreSpec.
p = pathspec.GitIgnoreSpec.from_lines(['*.yaml', '!*.yaml/'])
p.match_file('dir/file.yaml') # True
p.match_file('dir.yaml/file.yaml') # True
Thanks @cpburnz!