azurelinux icon indicating copy to clipboard operation
azurelinux copied to clipboard

Add new license validator tool

Open dmcilvaney opened this issue 1 year ago • 1 comments

Merge Checklist

All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)

  • [ ] The toolchain has been rebuilt successfully (or no changes were made to it)
  • [ ] The toolchain/worker package manifests are up-to-date
  • [ ] Any updated packages successfully build (or no packages were changed)
  • [ ] Packages depending on static components modified in this PR (Golang, *-static subpackages, etc.) have had their Release tag incremented.
  • [ ] Package tests (%check section) have been verified with RUN_CHECK=y for existing SPEC files, or added to new SPEC files
  • [ ] All package sources are available
  • [ ] cgmanifest files are up-to-date and sorted (./cgmanifest.json, ./toolkit/scripts/toolchain/cgmanifest.json, .github/workflows/cgmanifest.json)
  • [ ] LICENSE-MAP files are up-to-date (./SPECS/LICENSES-AND-NOTICES/data/licenses.json, ./SPECS/LICENSES-AND-NOTICES/LICENSES-MAP.md, ./SPECS/LICENSES-AND-NOTICES/LICENSE-EXCEPTIONS.PHOTON)
  • [ ] All source files have up-to-date hashes in the *.signatures.json files
  • [ ] sudo make go-tidy-all and sudo make go-test-coverage pass
  • [ ] Documentation has been updated to match any changes to the build system
  • [ ] If you are adding/removing a .spec file that has multiple-versions supported, please add @microsoft/cbl-mariner-multi-package-reviewers team as reviewer (Eg. golang has 2 versions 1.18, 1.21+)
  • [ ] Ready to merge

Summary

Some of our packages have poorly configured license files (e.g. COPYING, license.txt, etc.) that should accompany the packages when installed. Of most critical concern are packages which incorrectly mark these files as %doc, either explicitly or automatically based on file location. We care about this since we would like to be able to disable documentation when making size-constrained images. If the license files are incorrectly tagged as %doc they will not be installed in the images. This is important if we release these images. (The most critical issues affecting the core images are fixed here).

A tool was required to find these mis-configured packages to fix them, and ideally, we want to ensure that future packages do not regress. This is the first version of that regression checker. It currently is a stand-alone tool which can be run for all locally build packages, all packages used in an image, or an arbitrary directory containing *.rpm files.

In the future this tool will be integrated with the scheduler to validate rpms when they are built, and the imager tool to detect issues (especially if documentation is disabled).

Invoking the tool

This PR adds new make targets:

  • sudo make license-check SRPM_PACK_LIST="pkg1 pkg2 pkg3" will build pgk, pkg2, pgk3 and then validate their licenses
  • sudo make license-check-img CONFIG_FILE="./imageconfigs/core-efi.json" will gather all the packages needed for core-efi, then validate their licenses
  • sudo make license-check LICENSE_CHECK_DIRS="./path/to/rpms ./other/path/to/more/rpms" will scan for any *.rpm files in those directories, recursively.
Implementation

The changes are split into two parts: the licensecheck command line, and the common licensecheck package. The common package is designed to also be used as part of the image and package build tooling.

There are two parts to the package, a wrapper tool based on the simpletoolchroot package. It manages a worker chroot and provides New(), CleanUp(), CheckLicenses(), GetAllResults()/FormatResults(). These take a directory, mounts it into a chroot worker, and scans it, and then offers the results either as lists or a formatted message. These functions in turn call the second set of functions. The second set are used to scan individual RPMs and are suitable for use in any mariner-like environment. CheckRpmLicenses() will scan a specific RPM for license issues. IsALicenseFile() and IsASkippedLicenseFile() are used to interact with the matching heuristic (see below).

The tool outputs several types of results:

  • BadDocs: A critical issue where a %doc file is actually a license file. These are must-fix.
  • BadFiles: Less critical, a general file that the tool thinks is a license file. Still considered an error.
  • DuplicatedDocs: A warning, a license file appears to be duplicated in the docs.
Exceptions

The tool allows exceptions for bad detections. Exceptions can be added to toolkit/resources/manifests/package/license_file_exceptions.json and it will be ignored by the tools.

Heuristic

The matching is one based on filename only currently. In toolkit/tools/pkg/licensecheck/licensenames.go there are three lists, licenseNamesFuzzy, licenseNamesVerbatim, licenseNamesSkip. The fuzzy names are case-insensitive regexes that are matched against the file paths. They will match any substrings inside the file path, ignoring some well-known prefixes (e.g. /usr/share/licenses/pkg/licenses/somefile.txt will be converted to licenses/somefile.txt before matching, it would match against (?i).*license.*). The verbatim matches target only the basename of files exactly (e.g. /usr/share/licenses/pkg/licenses/MIT will be converted to MIT, it would match ^MIT$). Finally the skip lists behave the same as the first set of matches, but are files to ignore.

Test Data

A pair of tools can be used to validate the heuristic. generate_test_data.py will scan the currently active repos of its host machine (use azl3 container to get azl3 data for example) and generate lists of files: all_docs_<date>.txt, all_licenses_<date>.txt and all_other_files_<date>.txt (The all other files list is git ignored since it is very large). These can be used to calculate false positive/ false negative data with the go tool licensetestchecker.go (See README_license_check.md for details on regenerating and testing the lists).

As of 2024-05-07 the results are:

  • 66 false negative results against real %license files (1.91%)
  • 383 false positive against real %doc files (0.25%)
  • 1535 false positive against all other (not doc or license) files (0.47%)

NOTE: Some of these "false positives" are actually correct detections of legitimately bad license files

Limitations

The tool does not currently examine the contents of the files, only their paths. It therefor can only detect well-known styles of licenses.

It also is not able to determine if a package is MISSING licenses, only if the ones already included are packaged correctly. This has two parts: A package may completely omit required licenses, but it may also include them in the wrong sub-package. e.g. pkg-common which is required by both pkg, pkg-libs. The license files may be included in pkg-common and since it is required by the others it will bring the required license files. However, if the licenses were instead included in pkg, pkg-libs could be installed without pulling in the licenses (assuming there is no dependency from pkg-libs -> pkg).

Future Work

This tool must be manually run currently. Future features:

  • Document via contributing guide
  • Add a test to our build pipelines to flag these
  • Integrate into the package build tool (first as a warning, then an error)
  • Consider if we should handle the limitations listed above
Example output
sudo make license-check-img CONFIG_FILE=./imageconfigs/core-efi.json

Gives...

...

/home/damcilva/repos/CBL-Mariner/toolkit/out/tools/licensecheck \
        --rpm-dirs="/home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo" \
        --exception-file="/home/damcilva/repos/CBL-Mariner/toolkit/resources/manifests/package/license_file_exceptions.json" \
        --worker-tar="/home/damcilva/repos/CBL-Mariner/build/worker/worker_chroot.tar.gz" \
        --build-dir="/home/damcilva/repos/CBL-Mariner/build/license_check_tool" \
        --dist-tag=.azl3 \
         \
        --results-file="/home/damcilva/repos/CBL-Mariner/out/license_check/license_check_results_image_core-efi.json"
INFO[0000][licensecheck] Preparing license check environment for /home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo... 
INFO[0005][licensecheck] Scanning /home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo for license issues... 
INFO[0005][licensecheck] Queuing 217 rpms for search                  
INFO[0005][licensecheck] Searching rpms...                            
INFO[0005][licensecheck] Checked 22/217 rpms (10%)          
          
...

INFO[0006][licensecheck] Checked 217/217 rpms (100%)                  
INFO[0006][licensecheck] Search results for /home/damcilva/repos/CBL-Mariner/build/imagegen/core-efi/package_repo:
WARN: 'libaio-0.3.113-1.azl3.x86_64.rpm' has license warnings:
        duplicated license files:
                /usr/share/doc/libaio/COPYING
WARN: 'slang-2.3.3-1.azl3.x86_64.rpm' has license warnings:
        duplicated license files:
                /usr/share/doc/slang-2.3.3/COPYING
WARN: 'sudo-1.9.15p5-1.azl3.x86_64.rpm' has license warnings:
        duplicated license files:
                /usr/share/doc/sudo-1.9.15p5/LICENSE.md
ERROR: 'libdb-5.3.28-7.azl3.x86_64.rpm' has license errors:
        bad general file:
                /usr/share/licenses/LICENSE
                /usr/share/licenses/README
ERROR: 'lmdb-libs-0.9.31-1.azl3.x86_64.rpm' has license errors:
        bad %doc files:
                /usr/share/doc/lmdb/COPYRIGHT
        bad general file:
                /usr/share/licenses/lmdb/LICENSE
ERROR: 'userspace-rcu-0.14.0-1.azl3.x86_64.rpm' has license errors:
        bad %doc files:
                /usr/share/doc/userspace-rcu/LICENSE 
INFO[0006][licensecheck] Writing results to file: /home/damcilva/repos/CBL-Mariner/out/license_check/license_check_results_image_core-efi.json 
INFO[0006][licensecheck] 
Errors/warnings fall into three buckets:
        1. 'bad %doc files': A %doc documentation file that the tool believes to be a license file.
        2. 'bad general file': A file that is placed into '/usr/share/licenses/' that is not flagged as
                a license file. These files should use %license instead of %doc. Ideally whey should also
                not be placed in a directory manually. (e.g. prefer '%license COPYING' over
                '%license %{_docdir}/%{name}/COPYING')
        3. 'duplicated license files': A license file that is both a %license and a %doc file, pick one.")
                This is a warning, unless the tool is run in pedantic mode, in which case it is an error.
How to fix:
        - 'False positives': In all cases, a detection may be suppressed by using the exception file:
                /home/damcilva/repos/CBL-Mariner/toolkit/resources/manifests/package/license_file_exceptions.json.
                This file contains per-package and global exceptions in the form of regexes.
        - 'bad %%doc files': Mark it using %license, ideally without using a buildroot path (e.g. use '%license COPYING').
        - 'bad general file': Mark it using %license, ideally without using a buildroot path (e.g. use '%license COPYING').
        - 'duplicated license files': If they are actually equivalent, remove the copy in the documentation.
        - Query package contents with 'rpm -ql <package>.rpm' to see all files, 'rpm -qL <package>.rpm' to
                see only the license files, and 'rpm -qd <package>.rpm' to see only the documentation files. 
ERRO[0006][licensecheck] Found (3) packages with license errors.      
WARN[0006][licensecheck] Found (3) packages with non-fatal license issues. 
make: *** [/home/damcilva/repos/CBL-Mariner/toolkit/scripts/analysis.mk:128: license-check-img] Error 1
Change Log
  • Add licensecheck tool
  • Add license-check and license-check-img build targets
  • Add licensecheck package
  • Add testing infrastructure to validate the tool
Does this affect the toolchain?

NO

Associated issues
  • https://microsoft.visualstudio.com/DefaultCollection/OS/_workitems/edit/50220148
  • https://microsoft.visualstudio.com/DefaultCollection/OS/_workitems/edit/50276253
Test Methodology
  • Local testing

dmcilvaney avatar May 08 '24 23:05 dmcilvaney

I think we could get rid of this sudo make license-check SRPM_PACK_LIST="pkg1 pkg2 pkg3" will build pgk, pkg2, pgk3 and then validate their licenses as it mixes two things, building packages and validating licenses. If we get rid of the building packages part, we will only be validating licenses for rpms already built. Can we reuse the same chroot for all rpms that need to be validated? Might save us time and resources in creating/tearing down chroots

The idea is to be able to scan a list of packages (via SRPM_PACK_LIST since that controls that behavior in the tools). The scanner tool only ever uses a single chroot.

Phase two of this tool is to integrate it into the package build (and image build). That's the reason I exposed some of the basic functions as public. The package build chroot can call the functions on each *.rpm it has built without needing to spin up a dedicated chroot.

dmcilvaney avatar Jun 20 '24 21:06 dmcilvaney