scancode-toolkit
scancode-toolkit copied to clipboard
Use libxml2 as a test for the new "license detection"
https://gitlab.gnome.org/GNOME/libxml2/-/archive/v2.7.8/libxml2-v2.7.8.tar.gz is a good test as this is a common library annd it has many files with this notice "See Copyright for the status of this software." matched to https://github.com/nexB/scancode-toolkit/blob/6007301cca4eff424202abf197467ee2f004e139/src/licensedcode/data/rules/unknown-license-reference_30.yml
license_expression: unknown-license-reference
is_license_reference: yes
relevance: 100
referenced_filenames:
- Copyright
This is a file reference to https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/Copyright which is a https://scancode-licensedb.aboutcode.org/x11-xconsortium-veillard.html
This is also an odd ball since it declares itself as MIT-license but this is closely related but not a standard MIT license
I tested this, and we successfully resolve these unknown-license-reference to x11-xconsortium-veillard in all the cases. This is resolved by checking in the root of the scan for the referenced filename.
See results for this one file:
{
"path": "libxml2-v2.7.8/c14n.c",
"type": "file",
"detected_license_expression": "x11-xconsortium-veillard",
"detected_license_expression_spdx": "LicenseRef-scancode-x11-xconsortium-veillard",
"license_detections": [
{
"license_expression": "x11-xconsortium-veillard",
"detection_rules": [
"unknown-reference-to-local-file"
],
"matches": [
{
"score": 100.0,
"start_line": 8,
"end_line": 8,
"matched_length": 8,
"match_coverage": 100.0,
"matcher": "2-aho",
"license_expression": "unknown-license-reference",
"rule_identifier": "unknown-license-reference_30.RULE",
"referenced_filenames": [
"Copyright"
],
"is_license_text": false,
"is_license_notice": false,
"is_license_reference": true,
"is_license_tag": false,
"is_license_intro": false,
"rule_length": 8,
"rule_relevance": 100,
"matched_text": " * See Copyright for the status of this software.",
"licenses": [
{
"key": "unknown-license-reference",
"name": "Unknown License file reference",
"short_name": "Unknown License reference",
"category": "Unstated License",
"is_exception": false,
"is_unknown": true,
"owner": "Unspecified",
"homepage_url": null,
"text_url": "",
"reference_url": "https://scancode-licensedb.aboutcode.org/unknown-license-reference",
"scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE",
"scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml",
"spdx_license_key": "LicenseRef-scancode-unknown-license-reference",
"spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE"
}
]
},
{
"score": 100.0,
"start_line": 7,
"end_line": 26,
"matched_length": 199,
"match_coverage": 100.0,
"matcher": "2-aho",
"license_expression": "x11-xconsortium-veillard",
"rule_identifier": "x11-xconsortium-veillard.LICENSE",
"referenced_filenames": [],
"is_license_text": true,
"is_license_notice": false,
"is_license_reference": false,
"is_license_tag": false,
"is_license_intro": false,
"rule_length": 199,
"rule_relevance": 100,
"matched_text": "Permission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is fur-\nnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FIT-\nNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nDANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\nIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CON-\nNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\nExcept as contained in this notice, the name of Daniel Veillard shall not\nbe used in advertising or otherwise to promote the sale, use or other deal-\nings in this Software without prior written authorization from him.",
"licenses": [
{
"key": "x11-xconsortium-veillard",
"name": "X11-Style (X Consortium Veillard)",
"short_name": "X11-Style (X Consortium Veillard)",
"category": "Permissive",
"is_exception": false,
"is_unknown": false,
"owner": "Daniel Veillard",
"homepage_url": null,
"text_url": "",
"reference_url": "https://scancode-licensedb.aboutcode.org/x11-xconsortium-veillard",
"scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/x11-xconsortium-veillard.LICENSE",
"scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/x11-xconsortium-veillard.yml",
"spdx_license_key": "LicenseRef-scancode-x11-xconsortium-veillard",
"spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/x11-xconsortium-veillard.LICENSE"
}
]
}
]
}
],
"license_clues": [],
"percentage_of_license_text": 0.1,
"package_data": [],
"for_packages": [],
"scan_errors": []
},
Also attaching the full scan results for reference: libxml-v2.7.8-add-license-detection.json.txt
I will also add this example with minimal files in our test suite shortly, as this is a good example.
Note that this won't be detected properly if we scan the parent/other directories of libxml2-v2.7.8. I was planning to looks for package roots too in addition to scan root if we can't find anything at scan root for other cases.
There's also this case: https://github.com/nexB/scancode-toolkit/issues/2965 of see license in package which I was planning to tackle like this:
- Add rules with referenced-filename
package(this is hackish, but does the job) - in our de-referencing logic, if we find
packageas referenced-filename, we go to the package usingfor-packagefield. - If package found, and has declared_license_expression, we assign that. (we need the --package option enabled obviously)
But these is a problem with this, when the process_codebase step of license plugin runs, do we have the results from the main scanning of the package plugin if they are both enabled?
Even if we don't have the above, we can just use the "datafile_paths" to get the file license detections, but this is also not perfect.
Do you think there could be other heuristics like this?