Some package entries have wrong package metadata
We use sbom4python to construct an sbom for our product. Our environment was built around python 3.9 and the sbom generation seemed correct. Now that we upgraded our environment to python 3.12, we now noticed an issue with some packages. Especially numpy and scipy.
The sbom entries for these packages are replaced by other packages when running sbom4python in python 3.12
numpy becomes GCC runtime library scipy becomes libquadmath
In fact, these are sub dependencies of the related packages. The output of "pip show numpy" displays different metadata attributes of the package (Name:, Version:, etc.). But for these big packages, pip show also outputs some information about sub dependencies.
In python 3.9, the sub dependencies metadata output were indented. But in python 3.12, the metadata are not indented anymore and are then mingled at the same level. For example, lets compare the output of this command using different versions of python:
(python 3.9)
> pip show scipy | findstr Name:
Name: scipy
Name: OpenBLAS
Name: LAPACK
Name: GCC runtime library
Name: libquadmath
(python 3.12)
> pip show scipy | findstr Name:
Name: scipy
Name: OpenBLAS
Name: LAPACK
Name: GCC runtime library
Name: libquadmath
sbom4python uses pip show to recover packages metadata and build a metadata dictionnary by parsing the output of pip show. https://github.com/anthonyharrison/sbom4python/blob/f377631a68fffa2be3e1451f2f8d231816b965ca/sbom4python/scanner.py#L245
https://github.com/anthonyharrison/sbom4python/blob/f377631a68fffa2be3e1451f2f8d231816b965ca/sbom4python/scanner.py#L252
Since the entries for the displayed package and the subpackage are at the same level in python 3.12, then all entries of a subpackage that is already in the dictionary end up overwritting the current package entry. The last sub package described by pip show take precedence and replace the actual package description. Furthermore, some metadata attributes are mingled in the final metadata.
For these reasons, I would urge you to replace the recovery of metadata by a more structured and robust approach (https://github.com/anthonyharrison/sbom4python/issues/17).
I am currently working on addressing #17.
@dlebelcimmi Any improvement with the latest version of sbom4python?
Thank you for the update. I am not currently working on the project where I used sbom4python. I informed my team of the december releases. I don't know if its going to be a priortity to update sbom4python. If I have updates, I will post it.
Thank you again!
Hi @anthonyharrison. Yes it appears to have fixed the issue. I had to fix the output by patching SBOMScanner.process_module() to prevent package metadata overwriting. Now both scipy and numpy seems to be included correctly.
While this seems to be a good improvement, we expected the licenses to be provided by the metadata. But it appears that metadata are not especially enforced.
It seems that python classifiers might be a more standard source of truth (https://pypi.org/classifiers/) even though they are not guaranteed to be complete.
classifiers = metadata.get_all('Classifier')
We used pip-licenses to collect the licenses of our dependencies. This package seems to collect the licences from the classifiers and from the metadata : https://github.com/raimon49/pip-licenses?tab=readme-ov-file#option-from You might find that useful to collect all sort of data from a package.
Thanks @dlebelcimmi.
The licence identifiers in a SBOM need to be SPDX Licence identifiers. However the licenses in the classifiers are not valid SPDX licence identifiers. sbom4python does use the license classifier and will attempt to translate this into a valid SPDX licence identifier. e.g. License :: OSI Approved :: Apache Software License should be translated into the SPDX Licence identifier Apache-2.0.
I will look at creating a mapping from the license classifier to valid SPDX identifiers as this might be a useful enhancement (pip-licenses does not appear to be doing this).
Thats good news! Thanks!
Update to latest version of lib4sbom for update licence mappings