reuse-tool
reuse-tool copied to clipboard
Fix symlink handling
This PR implements the heuristics for handling symlinks I've described in more detail in https://github.com/fsfe/reuse-tool/issues/627#issuecomment-1418976389 and https://github.com/fsfe/reuse-tool/issues/627#issuecomment-1422594763.
Basically, this is based on an interpretation of the REUSE specification that:
- a symlink pointing to a covered file is considered to be the same file as the covered file and can therefore be ignored.
- a symlink pointing to a file that is not a covered file is itself considered to be a covered file and should not be ignored, unless the symlink is ignored by other means.
This fixes #627 while not breaking #202 for the described situation, since it would fall into category 1 above.
There is another bug in here when dealing with symlink targets in .git if the target file exists. I erroneously assumed Project._is_path_ignored considers a file inside .git as ignored, which is not the case (it only considers .git/.hg itself as ignored, but no sub-paths). I'll provide additional tests and a fix later.
edit: fixed
@wtraylor Feel free to try this out on a real-world project if you have one available. This was less straight-forward to implement than I thought it would be and there might still be some edge-cases I didn't consider. I tried to implement the symlink handling in a way I thought to be intuitive, but that might also differ from your expectations. So some testing would be great :)
Hi @matrss
This PR looks qualitatively really good. We'll need some time to review it, though.
It will also need a change to the specification. Although this version of the spec is not yet live, the current development version says that symlinks are ignored.
Many thanks for implementing this feature, @matrss !
For the main use case with git-annex it works great: reuse lint correctly complains about annexed files (i.e., symlinks pointing to content in .git) not having license information.
git init repo && cd repo
git annex init
cp ~/bigfile.dat ./
git annex add bigfile.dat # creates a symlink in place of bigfile.dat
reuse lint # fails correctly
reuse annotate --force-dot-license -l CC0-1.0 -c Author bigfile.dat
reuse download --all
reuse lint # passes correctly
The implementation appears to assume that all symlinks require dot-license files. For example:
git init repo && cd repo
git annex init
echo Hello World > README.md
reuse annotate -l CC0-1.0 -c Me README.md
reuse download --all
reuse lint # success
rm -r LICENSES
git annex add README.md
git commit -m "Add annexed README"
reuse lint # fails: no licensing info for README.md
reuse annotate -l CC0-1.0 -c Me README.md # fails: can’t write to README.md
I guess it makes sense to require dot-license files for symlinks because the symlink targets may not be available (i.e., symlink is broken, for example if annexed content is not in the working copy). However, I suggest to make this more transparent to the user—for example with these messages:
reuse annotateon a symlink file: “blablabla is a symbolic link. Please use --force-dot-license.”reuse lintwith a symlink file that doesn’t have a dot-license: “The following symbolic links have no copyright and licensing information in separate dot-license files:”reuse lintwith a symlinked dot-license file: “WARNING: Can’t parse blablabla.license because it’s a symbolic link. Dot-license files should be regular files.”
Point 3 stems from my experience that it can easily happen to annex a file by accident. Then you might end up with an annexed .license file that REUSE doesn’t parse.
Finally, there is the question of how to deal with symlink folders. This does not fall into the use case of git-annex (which only symlinks files). Here’s an example:
mkdir folder
echo Hello World > folder/foo
git init repo && cd repo
ln -s ../folder symlinked_folder
reuse lint # no complaints
I think this behavior is good. It doesn’t make sense for reuse to check files in symlinked folders with content outside of the repository. However, I wonder if it would make sense to make this transparent to the user. For example, reuse lint good give a hint that it ignored symlinked_folder because it’s a symbolic link.