syft
syft copied to clipboard
Licenses missing in most report format
What happened:
Scanning the same image leads to different results depending on the output format.
Type | Components | cpe | purl | Versions | Licenses | Notes |
---|---|---|---|---|---|---|
cyclonedx | 154 | 154 | 153 | 154 | 0 | WARN unable to convert relationship from CycloneDX 1.3 JSON, dropping... |
cyclonedx-json | 154 | 154 | 153 | 154 | 0 | WARN unable to convert relationship from CycloneDX 1.3 JSON, dropping... |
json | 153 | 806 | 153 | 153 | 153 | with 2 UNKNOWN License for python artifacts |
spdx-json | 153 | 806 | 153 | 153 | 0 | |
spdx-tag-value | 153 | 806 | 153 | 153 | 0 | |
text | 153 | 0 | 0 | 153 | 0 | |
table | 153 | 0 | 0 | 153 | 0 |
Scanning the same image using tern
Type | Components | cpe | purl | Versions | Licenses | Notes |
---|---|---|---|---|---|---|
cyclonedxjson | 149 | 0 | 149 | 149 | 149 | |
html | 149 | 0 | 0 | 149 | 149 | |
json | 149 | 0 | 0 | 149 | 149 | |
spdxtagvalue | 149 | 0 | 0 | 149 | 149 | |
spdxjson | 149 | 0 | 0 | 149 | 149 | |
yaml | 149 | 0 | 0 | 149 | 149 |
Thus the presence or absence of the license is not a format problem as for the common spdx or cyclonedx, tern is able to get this field correctly filled. As in json Syft is able to have all the information, this is probably in the converter that the loss occur (which is reflected I think by the WARN logs).
What you expected to happen:
Expectation is that content is independent of the format (if we except of course table and text) and everything that the format may accept shall be in the output.
How to reproduce it (as minimally and precisely as possible):
docker run \
--rm \
-it \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $PWD:/tmp/workdir \
anchore/syft:latest \
-v \
packages \
-s Squashed \
-o <format> \
--file /tmp/workdir/bom.format \
docker:almalinux:latest
Anything else we need to know?:
Environment:
- Output of
syft version
: 0.42.4 - OS (e.g:
cat /etc/os-release
or similar):
The licenses are most likely missing because their names are not listed in internal/spdxlicense/license_list.go
and syft uses that list to validate licenses when converting to formats like CycloneDX JSON. This behaviour seems to make sense because e.g. also Dependency-Track does not process invalid license names like "GPL" (missing the version) or "BSD-3-clause-with-weird-numbering" (honestely WTF?) in SBOM files.
The warnings mentioned in this issues' description do not affect the license processing.
Tagging @cpendery
@WhyJee I'm having trouble replicating your license counts. When trying to recreate your values using the almalinux
image tag closest to your post and Syft 0.42.4
, I'm only finding 4
licenses. While there is a difference between that and the other formats, its entirely based on the filtering @mj mentioned.
docker run \
--rm \
-it \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $PWD:/tmp/workdir \
anchore/syft:v0.42.4 \
-v \
packages \
-s Squashed \
-o json \
--file /tmp/workdir/bom.json \
docker:almalinux:8.5-20220306
I'm able to replicate this filtering out of licenses in 0.52.0
, with licenses like (Apache-2.0 OR MPL-1.1)
, Proprietary
, and BSD
being filtered out.
Based on your comments @mj / @spiffcs , I have replayed the analysis with latest Syft and latest Almalinux image.
There are 153 license entries in the json output which are identified as :
License | Count |
---|---|
BSD | 10 |
BSD and GPLv2 | 1 |
BSD and GPLv2+ | 3 |
BSD and LGPLv2+ | 1 |
BSD and LGPLv2 and Sleepycat | 2 |
BSD or GPLv2 | 1 |
BSD or GPLv2+ | 1 |
BSD with advertising | 1 |
GPLv2 | 4 |
GPLv2+ | 14 |
GPLv2+ and BSD | 1 |
GPLv2 and GPLv2+ and LGPLv2+ and BSD with advertising and Public Domain | 1 |
GPLv2+ and LGPLv2+ | 2 |
GPLv2+ and LGPLv2+ with exceptions | 2 |
GPLv2+ and Public Domain | 1 |
(GPLv2+ or AFL) and GPLv2+ | 5 |
GPLv2+ or LGPLv3+ | 4 |
(GPLv2+ or LGPLv3+) and GPLv3+ | 1 |
GPLv3+ | 12 |
GPLv3+ and GFDL | 1 |
GPLv3+ and GPLv2+ and LGPLv2+ and BSD | 1 |
GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions and LGPLv2+ and BSD | 2 |
GPLv3+ and LGPLv2+ | 2 |
GPLv3+ or BSD | 1 |
LGPL2.1+ (the library), GPL2+ (tests and examples) | 1 |
LGPLv2 | 2 |
LGPLv2+ | 24 |
LGPLv2+ and BSD and Public Domain | 1 |
LGPLv2+ and GPLv3+ | 3 |
LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL | 3 |
LGPLv2+ and MIT | 1 |
LGPLv2+ and MIT and GPLv2+ | 2 |
LGPLv3+ and GPLv3+ and GFDL | 1 |
LGPLv3+ or GPLv2+ | 2 |
(LGPLv3+ or GPLv2+) and GPLv3+ | 1 |
MIT | 18 |
MIT and Python and ASL 2.0 and BSD and ISC and LGPLv2 and MPLv2.0 and (ASL 2.0 or BSD) | 1 |
OpenLDAP | 1 |
OpenSSL and ASL 2.0 | 1 |
pubkey | 1 |
Public Domain | 9 |
Python | 2 |
SISSL and BSD | 1 |
UNKNOWN | 2 |
Vim and MIT | 1 |
zlib and Boost | 1 |
From this we can split the problem in several categories and eventually solve some.
Multiple licenses
This is the tricky case as the scanner would need a robust split algorithm (see above table). The you will have the issue below on name matching to solve of course.
Note: a commercial tool our company is also investigating transform:
MIT and Python and ASL 2.0 and BSD and ISC and LGPLv2 and MPLv2.0 and (ASL 2.0 or BSD)
into:
Python Software Foundation License 2.0 AND GNU Library General Public License v2 or later AND MIT License AND ISC License AND Apache License 2.0 AND BSD 3-clause ""New"" or ""Revised"" License AND Mozilla Public License 2.0
This is not a split, but it seems it is parsed pretty correctly.
Single license
Name mismatch
This is the most common issue for single name.
ASL 2.0 not matching one of "apache-2" "apache-2.0" "apache-2.0.0", not leading to license Apache-2.0 GPLv2 not matching one of "gpl-2" "gpl-2.0" "gpl-2.0.0", not leading to license GPL-2.0 GPL2+ or GPLv2+ not matching one of "gpl-2+" "gpl-2.0+" "gpl-2.0.0+", not leading to license GPL-2.0+ GPLv3+ not matching one of "gpl-3+" "gpl-3.0+" "gpl-3.0.0+", not leading to license GPL-3.0+ LGPLv2 not matching one of "lgpl-2" "lgpl-2.0" "lgpl-2.0.0", not leading to license LGPL-2.0 LGPL2.1+ not matching one of "lgpl-2+" "lgpl-2.0+" "lgpl-2.0.0+", not leading to license LGPL-2.0+ LGPLv2+ not matching one of "lgpl-2+" "lgpl-2.0+" "lgpl-2.0.0+", not leading to license LGPL-2.0+ LGPLv3+ not matching one of "lgpl-3+" "lgpl-3.0+" "lgpl-3.0.0+", not leading to license LGPL-3.0+ MPLv2.0 not matching one of "mpl-2" "mpl-2.0" "mpl-2.0.0", not leading to license MPL-2.0
I don't know what the other packager (Debian, ...) are putting as license, but it seems the solution could be to update the license_list.go
in order to make the match. I am not sure we can ask to RedHat to rewrite all its rpm to comply to Syft.
Name not recognized
This one occurs only if single license is "MIT" (18 occurrences) or "BSD" (10 occurrences). It seems but I have not checked that it is an exact match; we may have expected something case insensitive.
Solving these 2 issue would be 1st step.
For purposes of CycloneDX note that the format allows
"license": {"id": "SPDX ref"}
or alternatively if it cannot be matched:
"license": {"name": "any text you want"}
It would be great if at least in CycloneDX case the available information could be returned. A free text License "name" is better than nothing.
Hi @WhyJee,
thanks for bringing the different categories up!
One could argue, that in some component that is licenses as "MIT AND LGPL-2.1-only" there actually are "sub components" that are licensed differently. So from the perspective of CycloneDX, this should somehow be two components (that are smooshed together) and not one component with License "MIT AND LGPL-2.1-only".
But I'm pretty sure that this is just a rough estimation.
I've seen people release software under "GPL AND MIT" to tell you that you can choose one and sometimes people release Software under "GPL OR MIT" to give you this very choice.
On the other hand SPDX and Fedora seem, to agree that only "OR" should be used for this. And "AND" should be used if different parts of the component have different licenses.
Fedora has a guideline for this: https://docs.fedoraproject.org/en-US/legal/license-field/#_license_expressions SPDX has something similar: https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/
But while we are waiting for the "real" solution would it not be better to report unknown (unmapped) licenses as "whatever" than not reporting them at all?
Hi there,
I did some cross check and found that other CyclonDX-Tools seem to struggle with Licenses Expressions such as "(LGPLv3+ or GPLv2+) and GPLv3+" as well:
https://github.com/CycloneDX/cyclonedx-python/discussions/377 https://github.com/DependencyTrack/dependency-track/issues/170
The cyclonedx-python people went one step further then just struggeling hier:
https://github.com/CycloneDX/cyclonedx-python-lib/issues/304
CyclonDX seems to have a precise way of doing this by embracing SPDX-License-Expressions:
https://github.com/CycloneDX/specification/issues/1
One more thing: Dependency Track plans to support these SPDX-License-Expressions as stated here: https://github.com/DependencyTrack/dependency-track/issues/170#issuecomment-1169067549
Hi there, do we have any progress on this?
Hi @dawez -- I think this will be fixed with #1540