reuse-tool icon indicating copy to clipboard operation
reuse-tool copied to clipboard

lint is not properly checking the content of the licenses

Open Pablohn26 opened this issue 1 year ago • 3 comments

I leave here the commands to reproduce a bug when using reuse lint. First we create the files, add headers and download the license files:

mkdir reuse_bug
cd reuse_bug
echo "This is a test" > test1.c
echo "This is a test" > test2.c
reuse addheader --license MIT test1.c --copyright test --year 2022
reuse addheader --license GPL-3.0-only test2.c --copyright test --year 2022
reuse download MIT
reuse download GPL-3.0-only
reuse lint

Doing that, reuse lint is working properly and everything is ok. But, if we copy the content of one license file to another executing the following: cp LICENSES/GPL-3.0-only.txt LICENSES/MIT.txt, reuse lint does not drop any error, when it should say that license files are not correct (the content of the MIT license is not valid because it contains the GPL-3.0-only text.

That is, reuse lint should check the content of the license files is valid

Pablohn26 avatar Jul 09 '22 13:07 Pablohn26

As it is a missing feature rather than an error in existing code, I labeled it as an 'enhancement'. I agree a check would be better as it would provide stronger guarantees on the outcome of lint.

The main decision for the implementation would be how to check the contents. Currently reuse-tool doesn't come with license texts included, so this already a limiting factor. I can think of some solutions:

  • Download the licenses of the filenames (MIT for MIT.txt) and check the similarity.
  • Store checksum signatures of all licenses in the project. This would enable offline usage without bloating the source code.
  • Use existing license-checking libraries like ScanCode Toolkit that already have the ability to check license texts. This would seriously bloat the tool but could have other applications for linting like checking the source code for license occurrences.

It matters if the texts will be 1:1 the same as the example texts or if some modifications should be allowed like filling in the copyright holder instead of leaving a template line like Copyright [yyyy] [name of copyright owner] in the Apache 2.0 license text. Also on Windows many edited files end up with different line endings, so some sanitizing should be done already.

Looking at the current License Files section in the REUSE Specification there seems to be flexibility to modify the license text albeit undesirable. Still reuse-tool could at least give a warning instead of erroring out if modifications are allowed.

nicorikken avatar Jul 11 '22 05:07 nicorikken

The fundamental problem is that license texts unfortunately do change quite often. E.g. the GPL family is famous for smaller amendments (often undocumented), and of course people wrapping lines and copying from other projects don't make it better. And we're not even speaking about license templates such as MIT and BSD.

At REUSE, we would like to avoid heavy heuristics and the computing power this would need.

There are online tools to do this, e.g. https://tools.spdx.org/app/check_license. So if they had a public and stable API, I could imagine a separate REUSE command that uploads one or all licenses of the projects to these checkers and finds out whether the file name (SPDX tag) matches the closest outcomes.

mxmehl avatar Jul 26 '22 13:07 mxmehl

Public API would be interesting. I tried it myself. The response took 10 seconds. Not sure if that implies a heavy load on the server or it means that rate limiting is going on. I think we'd have to coordinate with the SPDX project if we start adopting their service for this purpose. I would prefer a library based on fuzzy matching or regular expressions if it exists.

nicorikken avatar Aug 31 '22 07:08 nicorikken