sbom4python icon indicating copy to clipboard operation
sbom4python copied to clipboard

Some package entries have wrong package metadata

Open dlebelcimmi opened this issue 1 year ago • 6 comments

We use sbom4python to construct an sbom for our product. Our environment was built around python 3.9 and the sbom generation seemed correct. Now that we upgraded our environment to python 3.12, we now noticed an issue with some packages. Especially numpy and scipy.

The sbom entries for these packages are replaced by other packages when running sbom4python in python 3.12

numpy becomes GCC runtime library scipy becomes libquadmath

In fact, these are sub dependencies of the related packages. The output of "pip show numpy" displays different metadata attributes of the package (Name:, Version:, etc.). But for these big packages, pip show also outputs some information about sub dependencies.

In python 3.9, the sub dependencies metadata output were indented. But in python 3.12, the metadata are not indented anymore and are then mingled at the same level. For example, lets compare the output of this command using different versions of python:

(python 3.9)

> pip show scipy | findstr Name:

Name: scipy
        Name: OpenBLAS
        Name: LAPACK
        Name: GCC runtime library
        Name: libquadmath

(python 3.12)

> pip show scipy | findstr Name:

Name: scipy
Name: OpenBLAS
Name: LAPACK
Name: GCC runtime library
Name: libquadmath

sbom4python uses pip show to recover packages metadata and build a metadata dictionnary by parsing the output of pip show. https://github.com/anthonyharrison/sbom4python/blob/f377631a68fffa2be3e1451f2f8d231816b965ca/sbom4python/scanner.py#L245

https://github.com/anthonyharrison/sbom4python/blob/f377631a68fffa2be3e1451f2f8d231816b965ca/sbom4python/scanner.py#L252

Since the entries for the displayed package and the subpackage are at the same level in python 3.12, then all entries of a subpackage that is already in the dictionary end up overwritting the current package entry. The last sub package described by pip show take precedence and replace the actual package description. Furthermore, some metadata attributes are mingled in the final metadata.

For these reasons, I would urge you to replace the recovery of metadata by a more structured and robust approach (https://github.com/anthonyharrison/sbom4python/issues/17).

dlebelcimmi avatar Sep 04 '24 19:09 dlebelcimmi

I am currently working on addressing #17.

anthonyharrison avatar Sep 07 '24 12:09 anthonyharrison

@dlebelcimmi Any improvement with the latest version of sbom4python?

anthonyharrison avatar Jan 26 '25 15:01 anthonyharrison

Thank you for the update. I am not currently working on the project where I used sbom4python. I informed my team of the december releases. I don't know if its going to be a priortity to update sbom4python. If I have updates, I will post it.

Thank you again!

dlebelcimmi avatar Jan 28 '25 14:01 dlebelcimmi

Hi @anthonyharrison. Yes it appears to have fixed the issue. I had to fix the output by patching SBOMScanner.process_module() to prevent package metadata overwriting. Now both scipy and numpy seems to be included correctly.

While this seems to be a good improvement, we expected the licenses to be provided by the metadata. But it appears that metadata are not especially enforced.

It seems that python classifiers might be a more standard source of truth (https://pypi.org/classifiers/) even though they are not guaranteed to be complete.

classifiers = metadata.get_all('Classifier')

We used pip-licenses to collect the licenses of our dependencies. This package seems to collect the licences from the classifiers and from the metadata : https://github.com/raimon49/pip-licenses?tab=readme-ov-file#option-from You might find that useful to collect all sort of data from a package.

dlebelcimmi avatar Feb 11 '25 21:02 dlebelcimmi

Thanks @dlebelcimmi.

The licence identifiers in a SBOM need to be SPDX Licence identifiers. However the licenses in the classifiers are not valid SPDX licence identifiers. sbom4python does use the license classifier and will attempt to translate this into a valid SPDX licence identifier. e.g. License :: OSI Approved :: Apache Software License should be translated into the SPDX Licence identifier Apache-2.0.

I will look at creating a mapping from the license classifier to valid SPDX identifiers as this might be a useful enhancement (pip-licenses does not appear to be doing this).

anthonyharrison avatar Feb 12 '25 07:02 anthonyharrison

Thats good news! Thanks!

dlebelcimmi avatar Feb 13 '25 14:02 dlebelcimmi

Update to latest version of lib4sbom for update licence mappings

anthonyharrison avatar Jun 21 '25 18:06 anthonyharrison