Add new license validator tool
Merge Checklist
All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)
- [ ] The toolchain has been rebuilt successfully (or no changes were made to it)
- [ ] The toolchain/worker package manifests are up-to-date
- [ ] Any updated packages successfully build (or no packages were changed)
- [ ] Packages depending on static components modified in this PR (Golang,
*-staticsubpackages, etc.) have had theirReleasetag incremented. - [ ] Package tests (%check section) have been verified with RUN_CHECK=y for existing SPEC files, or added to new SPEC files
- [ ] All package sources are available
- [ ] cgmanifest files are up-to-date and sorted (
./cgmanifest.json,./toolkit/scripts/toolchain/cgmanifest.json,.github/workflows/cgmanifest.json) - [ ] LICENSE-MAP files are up-to-date (
./SPECS/LICENSES-AND-NOTICES/data/licenses.json,./SPECS/LICENSES-AND-NOTICES/LICENSES-MAP.md,./SPECS/LICENSES-AND-NOTICES/LICENSE-EXCEPTIONS.PHOTON) - [ ] All source files have up-to-date hashes in the
*.signatures.jsonfiles - [ ]
sudo make go-tidy-allandsudo make go-test-coveragepass - [ ] Documentation has been updated to match any changes to the build system
- [ ] If you are adding/removing a .spec file that has multiple-versions supported, please add @microsoft/cbl-mariner-multi-package-reviewers team as reviewer (Eg. golang has 2 versions 1.18, 1.21+)
- [ ] Ready to merge
Summary
Some of our packages have poorly configured license files (e.g. COPYING, license.txt, etc.) that should accompany the packages when installed. Of most critical concern are packages which incorrectly mark these files as %doc, either explicitly or automatically based on file location. We care about this since we would like to be able to disable documentation when making size-constrained images. If the license files are incorrectly tagged as %doc they will not be installed in the images. This is important if we release these images. (The most critical issues affecting the core images are fixed here).
A tool was required to find these mis-configured packages to fix them, and ideally, we want to ensure that future packages do not regress. This is the first version of that regression checker. It currently is a stand-alone tool which can be run for all locally build packages, all packages used in an image, or an arbitrary directory containing *.rpm files.
In the future this tool will be integrated with the scheduler to validate rpms when they are built, and the imager tool to detect issues (especially if documentation is disabled).
Invoking the tool
This PR adds new make targets:
sudo make license-check SRPM_PACK_LIST="pkg1 pkg2 pkg3"will buildpgk,pkg2,pgk3and then validate their licensessudo make license-check-img CONFIG_FILE="./imageconfigs/core-efi.json"will gather all the packages needed forcore-efi, then validate their licensessudo make license-check LICENSE_CHECK_DIRS="./path/to/rpms ./other/path/to/more/rpms"will scan for any*.rpmfiles in those directories, recursively.
Implementation
The changes are split into two parts: the licensecheck command line, and the common licensecheck package. The common package is designed to also be used as part of the image and package build tooling.
There are two parts to the package, a wrapper tool based on the simpletoolchroot package. It manages a worker chroot and provides New(), CleanUp(), CheckLicenses(), GetAllResults()/FormatResults(). These take a directory, mounts it into a chroot worker, and scans it, and then offers the results either as lists or a formatted message. These functions in turn call the second set of functions. The second set are used to scan individual RPMs and are suitable for use in any mariner-like environment. CheckRpmLicenses() will scan a specific RPM for license issues. IsALicenseFile() and IsASkippedLicenseFile() are used to interact with the matching heuristic (see below).
The tool outputs several types of results:
BadDocs: A critical issue where a%docfile is actually a license file. These are must-fix.BadFiles: Less critical, a general file that the tool thinks is a license file. Still considered an error.DuplicatedDocs: A warning, a license file appears to be duplicated in the docs.
Exceptions
The tool allows exceptions for bad detections. Exceptions can be added to toolkit/resources/manifests/package/license_file_exceptions.json and it will be ignored by the tools.
Heuristic
The matching is one based on filename only currently. In toolkit/tools/pkg/licensecheck/licensenames.go there are three lists, licenseNamesFuzzy, licenseNamesVerbatim, licenseNamesSkip. The fuzzy names are case-insensitive regexes that are matched against the file paths. They will match any substrings inside the file path, ignoring some well-known prefixes (e.g. /usr/share/licenses/pkg/licenses/somefile.txt will be converted to licenses/somefile.txt before matching, it would match against (?i).*license.*). The verbatim matches target only the basename of files exactly (e.g. /usr/share/licenses/pkg/licenses/MIT will be converted to MIT, it would match ^MIT$). Finally the skip lists behave the same as the first set of matches, but are files to ignore.
Test Data
A pair of tools can be used to validate the heuristic. generate_test_data.py will scan the currently active repos of its host machine (use azl3 container to get azl3 data for example) and generate lists of files: all_docs_<date>.txt, all_licenses_<date>.txt and all_other_files_<date>.txt (The all other files list is git ignored since it is very large). These can be used to calculate false positive/ false negative data with the go tool licensetestchecker.go (See README_license_check.md for details on regenerating and testing the lists).
As of 2024-05-07 the results are:
- 66 false negative results against real %license files (1.91%)
- 383 false positive against real %doc files (0.25%)
- 1535 false positive against all other (not doc or license) files (0.47%)
NOTE: Some of these "false positives" are actually correct detections of legitimately bad license files
Limitations
The tool does not currently examine the contents of the files, only their paths. It therefor can only detect well-known styles of licenses.
It also is not able to determine if a package is MISSING licenses, only if the ones already included are packaged correctly. This has two parts: A package may completely omit required licenses, but it may also include them in the wrong sub-package. e.g. pkg-common which is required by both pkg, pkg-libs. The license files may be included in pkg-common and since it is required by the others it will bring the required license files. However, if the licenses were instead included in pkg, pkg-libs could be installed without pulling in the licenses (assuming there is no dependency from pkg-libs -> pkg).
Future Work
This tool must be manually run currently. Future features:
- Document via contributing guide
- Add a test to our build pipelines to flag these
- Integrate into the package build tool (first as a warning, then an error)
- Consider if we should handle the limitations listed above
Example output
sudo make license-check-img CONFIG_FILE=./imageconfigs/core-efi.json
Gives...
...
/home/damcilva/repos/CBL-Mariner/toolkit/out/tools/licensecheck \
--rpm-dirs="/home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo" \
--exception-file="/home/damcilva/repos/CBL-Mariner/toolkit/resources/manifests/package/license_file_exceptions.json" \
--worker-tar="/home/damcilva/repos/CBL-Mariner/build/worker/worker_chroot.tar.gz" \
--build-dir="/home/damcilva/repos/CBL-Mariner/build/license_check_tool" \
--dist-tag=.azl3 \
\
--results-file="/home/damcilva/repos/CBL-Mariner/out/license_check/license_check_results_image_core-efi.json"
INFO[0000][licensecheck] Preparing license check environment for /home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo...
INFO[0005][licensecheck] Scanning /home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo for license issues...
INFO[0005][licensecheck] Queuing 217 rpms for search
INFO[0005][licensecheck] Searching rpms...
INFO[0005][licensecheck] Checked 22/217 rpms (10%)
...
INFO[0006][licensecheck] Checked 217/217 rpms (100%)
INFO[0006][licensecheck] Search results for /home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo:
WARN: 'libaio-0.3.113-1.azl3.x86_64.rpm' has license warnings:
duplicated license files:
/usr/share/doc/libaio/COPYING
WARN: 'slang-2.3.3-1.azl3.x86_64.rpm' has license warnings:
duplicated license files:
/usr/share/doc/slang-2.3.3/COPYING
WARN: 'sudo-1.9.15p5-1.azl3.x86_64.rpm' has license warnings:
duplicated license files:
/usr/share/doc/sudo-1.9.15p5/LICENSE.md
ERROR: 'libdb-5.3.28-7.azl3.x86_64.rpm' has license errors:
bad general file:
/usr/share/licenses/LICENSE
/usr/share/licenses/README
ERROR: 'lmdb-libs-0.9.31-1.azl3.x86_64.rpm' has license errors:
bad %doc files:
/usr/share/doc/lmdb/COPYRIGHT
bad general file:
/usr/share/licenses/lmdb/LICENSE
ERROR: 'userspace-rcu-0.14.0-1.azl3.x86_64.rpm' has license errors:
bad %doc files:
/usr/share/doc/userspace-rcu/LICENSE
INFO[0006][licensecheck] Writing results to file: /home/damcilva/repos/CBL-Mariner/out/license_check/license_check_results_image_core-efi.json
INFO[0006][licensecheck]
Errors/warnings fall into three buckets:
1. 'bad %doc files': A %doc documentation file that the tool believes to be a license file.
2. 'bad general file': A file that is placed into '/usr/share/licenses/' that is not flagged as
a license file. These files should use %license instead of %doc. Ideally whey should also
not be placed in a directory manually. (e.g. prefer '%license COPYING' over
'%license %{_docdir}/%{name}/COPYING')
3. 'duplicated license files': A license file that is both a %license and a %doc file, pick one.")
This is a warning, unless the tool is run in pedantic mode, in which case it is an error.
How to fix:
- 'False positives': In all cases, a detection may be suppressed by using the exception file:
/home/damcilva/repos/CBL-Mariner/toolkit/resources/manifests/package/license_file_exceptions.json.
This file contains per-package and global exceptions in the form of regexes.
- 'bad %%doc files': Mark it using %license, ideally without using a buildroot path (e.g. use '%license COPYING').
- 'bad general file': Mark it using %license, ideally without using a buildroot path (e.g. use '%license COPYING').
- 'duplicated license files': If they are actually equivalent, remove the copy in the documentation.
- Query package contents with 'rpm -ql <package>.rpm' to see all files, 'rpm -qL <package>.rpm' to
see only the license files, and 'rpm -qd <package>.rpm' to see only the documentation files.
ERRO[0006][licensecheck] Found (3) packages with license errors.
WARN[0006][licensecheck] Found (3) packages with non-fatal license issues.
make: *** [/home/damcilva/repos/CBL-Mariner/toolkit/scripts/analysis.mk:128: license-check-img] Error 1
Change Log
- Add
licensechecktool - Add
license-checkandlicense-check-imgbuild targets - Add
licensecheckpackage - Add testing infrastructure to validate the tool
Does this affect the toolchain?
NO
Associated issues
- https://microsoft.visualstudio.com/DefaultCollection/OS/_workitems/edit/50220148
- https://microsoft.visualstudio.com/DefaultCollection/OS/_workitems/edit/50276253
Test Methodology
- Local testing
I think we could get rid of this
sudo make license-check SRPM_PACK_LIST="pkg1 pkg2 pkg3" will build pgk, pkg2, pgk3 and then validate their licensesas it mixes two things, building packages and validating licenses. If we get rid of the building packages part, we will only be validating licenses for rpms already built. Can we reuse the same chroot for all rpms that need to be validated? Might save us time and resources in creating/tearing down chroots
The idea is to be able to scan a list of packages (via SRPM_PACK_LIST since that controls that behavior in the tools). The scanner tool only ever uses a single chroot.
Phase two of this tool is to integrate it into the package build (and image build). That's the reason I exposed some of the basic functions as public. The package build chroot can call the functions on each *.rpm it has built without needing to spin up a dedicated chroot.