scancode.io
scancode.io copied to clipboard
meta issue: Improve Debian package reported license
See these issues for details:
- Improve quality and tracing of license detection in Debian copyright files https://github.com/nexB/scancode-toolkit/issues/2390
- Determine the primary license from a copyright file https://github.com/nexB/debut/issues/8
- Recover parsing from almost machine-readable copyright files https://github.com/nexB/debut/issues/6
- Improve tracing of license detection in package manifests https://github.com/nexB/scancode-toolkit/issues/2389
See also #128
We have many levels of problems:
1. finding the copyright file of a package.
There are case where we have the copyright file of a package which is a symlink to the copyright file of another package and we fail to get the copyright file in this case.
For instance in debian-unstable-slim, the directory /usr/share/doc/libstdc++6
is a symlink to the directory /usr/share/doc/gcc-10-base
therefore the copyright file is /usr/share/doc/gcc-10-base/copyright
Short of followig this link we cannot access the copyright file, because the source package gcc-10
does not have a copyright file and is not installed and we cannot use the heuristic to use the source package copyright when we cannot find one for the binary.
2. dealing with copyright formats
2.1 not machine-readable
we do not partition files that are not machine-readable and this may impact license detection accuracy. There are several opportunities to improve this for instance with a heuristic that would split text regions in paragraph-like chunks based on the presence of some typical statements or even license rules such as:
On Debian systems, the complete text of the GNU General Public
License version 2 can be found in /usr/share/common-licenses/GPL-2
.
Also in some almost structured files, we could split on lines starting with "License:" or "Copyright:" or "Copyright notice:" such as: https://metadata.ftp-master.debian.org/changelogs//main/u/unzip/unzip_6.0-23+deb10u2_copyright or https://metadata.ftp-master.debian.org/changelogs//main/e/e2fsprogs/
2.2 structured copyright files
when we detect license on structured copyright files, we do not handle correctly the fact that a license is a known common license or not
Only known common licenses symbols as used in the first line of a license declaration have a meaning. Other symbols (even when they look like an SPDX license id such as BSD-2-Clause) should be interpreted first based on the detection of the license text they point to a license paragraph. This is not done yet and impacts the quality of detection on the declared licenses
3. incorrect license simplification
We have incorrect license simplification that is applied on the detected license expressions. We should not apply simplification for now and rather fix it in the license_expression library. See https://github.com/nexB/license-expression/issues/49
4. Inaccurate license detection proper
We have incorrect license detection on multiple levels:
4.1 Incorrect mapping of common debian licenses
We do not have correct mapping for known license symbols of common licenses when we are trying to detect a license as an expression. The set of these is limited in the ones found in `/usr/share/common-licenses/. For instance:
Apache-2.0
Artistic
BSD
GFDL -> GFDL-1.3
GFDL-1.2
GFDL-1.3
GPL -> GPL-3
GPL-1
GPL-2
GPL-3
LGPL -> LGPL-3
LGPL-2
LGPL-2.1
LGPL-3
And also the symbols with a trailing + (NB: Artistic would need to be detected to find what we map it to)
4.2 we do not detect correctly some license expression syntax from the declared license
For instance, this weird "academic free license >= 2.1, modified bsd license" where using a mapping in debian_licenses.txt
may be the only way out.
Though we may be able to apply heuristics where we could replace a comma by " AND " before parsing a license declaration line as an expression.
Because of 4.2 and 4.3 we return way too many unknown licenses
4.3 we are missing license detection rules to detect accurately the licenses
This is a matter of adding new license rules.
5. diagnose detection errors is hard
We cannot easily diagnose and fix license detection issues because the details of the detection are not returned. For instance we cannot easily use scancode-analyzer to help spot and fix issues.
See also https://github.com/nexB/aboutcode/wiki/Project-Ideas-Improve-Debian-package-license-detection
From https://github.com/nexB/scancode-toolkit/pull/2518, detailing the improvements made in each level of the problem.
On the specific issue reported in https://github.com/nexB/scancode.io/issues/128, we have the unstructured copyright file of gcc-10-base
debian package.
The updated debian copyright system has a complete overhaul of the license detection and fixing of certain bugs which made possible the improvement here:
Before Changes vs After Changes
Now, thare are still minor inaccuracies here which are being fixed.
On the progress made in the specific levels of issues discussed in this comment above:
In 2. Dealing with copyright formats:
2.1 Machine Readable Copyrights:
Status: Some critical bugs were fixed, this is now sent directly into scancode license detection as a whole, getting much better results. WIP: Break this file into parts of texts using common paragraphs seperators seen in debian copyright files, for better detection.
2.2 structured Copyright Files
Status: Mostly done, now working on handling rare cases by running tests on dataset of collected debian copyright samples Debian -(320K from 2019-11) and Ubuntu (200K files from 2020-06).
In debian copyright files, there are license paragraphs with license text and a license name after License:
. Sometimes there are license texts in the file paragraphs also, and there also exists common debian licenses.
These licenses are then referenced in File and Header paragraphs in license expression like strings, and they reference to the license texts by their name. Now we fully parse these names and resolve the references to the license texts (instead of having a hand crafted mapping), even resolve unparsable expressions if these are also present as names of license texts.
Also filters are added when reporting license detections to summarize detection based on Primary License Paragraph, Debian paragraphs and to only return unique license detections. Also the option to simplify would be added after https://github.com/nexB/license-expression/pull/53/ is merged and released.
This significantly improves license detection in structured copyright files.
In 3. incorrect license simplification
Status: This is fixed at license-expression, in the process of being merged.
In 4. Inaccurate License Detection proper
4.1 Common Licenses present in /usr/share/common-licenses/
Status: These are now handled correctly.
4.2 we do not detect correctly some license expression syntax from the declared license
Status: We now can parse the debian license expressions correctly, with cleaning and some specialized parsing of commas, according to the debian guidelines.
Previously in debian_licenses.txt there was a mapping of all seen debian license expression present after License:
, and the corresponding license expression.
Now, instead of having a mapping, these are now handled via cleaning up symbols which aren't supported by nexB/license-expression, and then parsing these as proper license expressions.
4.3 we are missing license detection rules to detect accurately the licenses
Status: WIP, this has been made possible by making the license detection diagnosable [in 5. Diagnosing License Detection Problems ]
New rules are added for common license detections, more rules are being added based on the added debian test files.
Then even more rules can be added by running nexB/scancode-analyzer on more debian copyright files.
In 5. Diagnosing License Detection Problems:
Status: License detections are now fully diagnoseable.
Previously, the license detections in a debian copyright file had as it's output only a license-expression string carrying all the detections, and hence it was hard to diagnose license detection problems, Now the license and copyright detection function returns a DebianDetector object with a list of LicenseDetection objects, which has the original LicenseMatch objects created by scancode LicenseDetection. This makes it possible to diagnose the root cause of license detection issues and also makes it possible to plug in the results from license detections in debian copyright files directly to https://github.com/nexB/scancode-analyzer for unique issue detection.
in 1. Bug in symlinks
Status: This is yet to be fixed.
@AyanSinhaMahapatra @pombredanne anything else coming on this one or are we ready to close?
@AyanSinhaMahapatra @pombredanne gentle ping, what's the latest status on this one?
The PRs were
- https://github.com/nexB/scancode-toolkit/pull/2518
- https://github.com/nexB/scancode-toolkit/pull/2558
- https://github.com/nexB/debian-inspector/pull/22
- Some commits in https://github.com/nexB/scancode-toolkit/pull/2667
There are two sub issues remaining,
- correctly handling symlinks, as debain copyrights are often symlinked as elaborated here in 1. This could be tracked seperately, I can open an issue for that then.
- simplifying license-expressions with AND which is tracked seperately here : https://github.com/nexB/license-expression/issues/67
And @pombredanne opened some more relatively minor ones, I have these on my to-do list:
- https://github.com/nexB/scancode-toolkit/issues/2646
- https://github.com/nexB/scancode-toolkit/issues/2645
- https://github.com/nexB/scancode-toolkit/issues/2644
- https://github.com/nexB/scancode-toolkit/issues/2643
- https://github.com/nexB/scancode-toolkit/issues/2642
As these are tracked seperately, and the major issues tracked here with debian was resolved and detection improved significantly, this meta issue could be closed,