reuse-tool icon indicating copy to clipboard operation
reuse-tool copied to clipboard

Fix symlink handling

Open matrss opened this issue 2 years ago • 5 comments

This PR implements the heuristics for handling symlinks I've described in more detail in https://github.com/fsfe/reuse-tool/issues/627#issuecomment-1418976389 and https://github.com/fsfe/reuse-tool/issues/627#issuecomment-1422594763.

Basically, this is based on an interpretation of the REUSE specification that:

  1. a symlink pointing to a covered file is considered to be the same file as the covered file and can therefore be ignored.
  2. a symlink pointing to a file that is not a covered file is itself considered to be a covered file and should not be ignored, unless the symlink is ignored by other means.

This fixes #627 while not breaking #202 for the described situation, since it would fall into category 1 above.

matrss avatar Jun 19 '23 10:06 matrss

There is another bug in here when dealing with symlink targets in .git if the target file exists. I erroneously assumed Project._is_path_ignored considers a file inside .git as ignored, which is not the case (it only considers .git/.hg itself as ignored, but no sub-paths). I'll provide additional tests and a fix later.

edit: fixed

matrss avatar Jun 20 '23 07:06 matrss

@wtraylor Feel free to try this out on a real-world project if you have one available. This was less straight-forward to implement than I thought it would be and there might still be some edge-cases I didn't consider. I tried to implement the symlink handling in a way I thought to be intuitive, but that might also differ from your expectations. So some testing would be great :)

matrss avatar Jun 22 '23 09:06 matrss

Hi @matrss

This PR looks qualitatively really good. We'll need some time to review it, though.

It will also need a change to the specification. Although this version of the spec is not yet live, the current development version says that symlinks are ignored.

carmenbianca avatar Jun 22 '23 10:06 carmenbianca

Many thanks for implementing this feature, @matrss ! For the main use case with git-annex it works great: reuse lint correctly complains about annexed files (i.e., symlinks pointing to content in .git) not having license information.

git init repo && cd repo
git annex init
cp ~/bigfile.dat ./
git annex add bigfile.dat # creates a symlink in place of bigfile.dat
reuse lint  # fails correctly
reuse annotate --force-dot-license -l CC0-1.0 -c Author bigfile.dat
reuse download --all
reuse lint # passes correctly

The implementation appears to assume that all symlinks require dot-license files. For example:

git init repo && cd repo
git annex init
echo Hello World > README.md
reuse annotate -l CC0-1.0 -c Me README.md
reuse download --all
reuse lint # success
rm -r LICENSES
git annex add README.md
git commit -m "Add annexed README"
reuse lint # fails: no licensing info for README.md
reuse annotate -l CC0-1.0 -c Me README.md # fails: can’t write to README.md

I guess it makes sense to require dot-license files for symlinks because the symlink targets may not be available (i.e., symlink is broken, for example if annexed content is not in the working copy). However, I suggest to make this more transparent to the user—for example with these messages:

  1. reuse annotate on a symlink file: “blablabla is a symbolic link. Please use --force-dot-license.”
  2. reuse lint with a symlink file that doesn’t have a dot-license: “The following symbolic links have no copyright and licensing information in separate dot-license files:”
  3. reuse lint with a symlinked dot-license file: “WARNING: Can’t parse blablabla.license because it’s a symbolic link. Dot-license files should be regular files.”

Point 3 stems from my experience that it can easily happen to annex a file by accident. Then you might end up with an annexed .license file that REUSE doesn’t parse.

Finally, there is the question of how to deal with symlink folders. This does not fall into the use case of git-annex (which only symlinks files). Here’s an example:

mkdir folder
echo Hello World > folder/foo
git init repo && cd repo
ln -s ../folder symlinked_folder
reuse lint # no complaints

I think this behavior is good. It doesn’t make sense for reuse to check files in symlinked folders with content outside of the repository. However, I wonder if it would make sense to make this transparent to the user. For example, reuse lint good give a hint that it ignored symlinked_folder because it’s a symbolic link.

wtraylor avatar Jul 24 '23 14:07 wtraylor