scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Enhance Alpine package scan results

Open aalexanderr opened this issue 3 years ago • 17 comments

Alpine packages lack some important info like copyrights or where the source code is located. This info can't be gathered from the packages themselves as its just not there. To get this info we need to: download aports repo & for each pkg check it out on commit specific to alpine package (via fetchcode) parse APKBUILD nexB/scancode-toolkit#2541 download package sources (fetchcode) & amend new info to package's scan results

Discussed a bit with @pombredanne Most likely @quepop will PR it

The question is- should it be standard behavior when alpine based docker is being scanned or should it be a separate pipeline?

aalexanderr avatar Jun 08 '21 14:06 aalexanderr

Related: https://github.com/nexB/scancode-toolkit/issues/2541#issuecomment-856904754

quepop avatar Jun 08 '21 16:06 quepop

@pombredanne should I PR it?

quepop avatar Jun 10 '21 06:06 quepop

@quepop please do! that's much easier for review Of note:

  • I am not super comfy (that's an understatement) with sourcing arbitrary shell scripts at https://github.com/quepop/scancode.io/commit/2f59440b97cfc89e4bc8291d7839188d25e0fb34#diff-a6e238494a11c38a54d401da6211e8e3461898f10e22ff6673a7896ffedbbd3eR43
  • I am not sure we want the alpine package getter to always fetch packages. If anything I know several users run in an airgap'ed environment and they surely will not want (and will not be able anyway) to fetch things from the outside world

pombredanne avatar Jun 10 '21 08:06 pombredanne

@pombredanne I don't think there is any other way. In order to extract source variable value (or any other value) we need to run the script (APKBUILD file), because in some of the alpine packages, source variable is built in a for loop (not to mention that sometimes bash string formatting is used).

From pkg:alpine/[email protected] APKBUILD file:

_arch_keys="
        aarch64:[email protected]
        armhf:[email protected]
 
        x86:[email protected]
        x86,x86_64:[email protected]
        x86_64:[email protected]
 
        ppc64le:[email protected]
 
        s390x:[email protected]
 
        mips64:[email protected]
        "
 
for _i in $_arch_keys; do
        source="$source ${_i#*:}"
done

quepop avatar Jun 10 '21 22:06 quepop

@quepop good point, but do we need these (e.g. these _arch_keys) ?

pombredanne avatar Jun 10 '21 22:06 pombredanne

@pombredanne We need the source variable to be complete (to be able to download sources and scan them for copyrights). I tried to emphasize (by giving an example) that the only way we get a complete source variable every time is by running the script.

quepop avatar Jun 10 '21 23:06 quepop

@pombredanne Regarding https://github.com/nexB/scancode-toolkit/issues/2541#issuecomment-858408580, how should I name these new files (pipe, pipeline)? My idea is /pipes/alpine_helper.py and /pipelines/alpine_complement.py

quepop avatar Jun 24 '21 20:06 quepop

@quepop There's already a pipes/alpine.py pipes module in which you can add the new functions. For the pipeline, what about "alpine_packages.py", unless it's too specific?

tdruez avatar Jun 25 '21 05:06 tdruez

@tdruez In the case of multi-image projects we cannot match DiscoveredPackage from the database to the image (and alpine version) they are from - db entry for DiscoveredPackage only include project_id (no image_id or image_index field). Alpine version is only present in project.extra_data.images array. It's a problem because in the aports repository every alpine version is a certain branch and we need to pull them and make checkouts (branch and then commit-id) to be able to extract correct APKBUILDs.

My solution is to add a new database field to the DiscoveredPackage class named image_index which would store an index (project.extra_data.images array) of the image it is from from. Is it acceptable?

quepop avatar Jul 01 '21 02:07 quepop

@quepop what about adding the extra_data field on the DiscoveredPackage model, for consistency with Project and CodebaseResource models? Let me know if that would work for you and I'll make that change upstream.

tdruez avatar Jul 01 '21 05:07 tdruez

Repasting from https://github.com/quepop/scancode.io/commit/2f59440b97cfc89e4bc8291d7839188d25e0fb34#r52902214 for reference : @quepop

I have been thinking more about this for nexB#191 and sourcing arbitrary bash script this way is too much of a security concern.

An alternative could include:

  • using a container, jail or some sorts of chroot to try to minimize the risk a little
  • implement a simple bash/shell script parser and perform parameter expansion

I think 2. is best and much less involved than having a container depdendency. Furthermore, there could be other variables of interest in an APKBUILD and we need eventually to parse other shell-based manifests to extract metadata such as PKGBUILD (Arch), ebuild (Gentoo) m4 (Autotools) and a few more.

Therefore I am implementing this that can then be reused here:

  • using a simplified Pygments shell lexer derived from https://github.com/pygments/pygments/blob/master/pygments/lexers/shell.py
  • adding lightweight parsing to recognize variables declaration possibly based on https://github.com/nexB/pygmars/ ... or just hand crafted as this is rather simple once the harder lexing is done above
  • doing parameter expansion with @kojiromike https://github.com/kojiromike/parameter-expansion/ and this PR I pushed yesterday kojiromike/parameter-expansion#9 to address some bash'ism used and needed for APKGBUILD handling

pombredanne avatar Jul 01 '21 08:07 pombredanne

@quepop what about adding the extra_data field on the DiscoveredPackage model, for consistency with Project and CodebaseResource models? Let me know if that would work for you and I'll make that change upstream.

:+1: ... this is being added too to SCTK FWIW

pombredanne avatar Jul 01 '21 08:07 pombredanne

@pombredanne sounds great but what about APKBUILDs like these: https://github.com/nexB/scancode.io/issues/191#issuecomment-859129456? Will it work? @tdruez Yes, it would be great.

quepop avatar Jul 01 '21 14:07 quepop

@quepop extra_data added on the DiscoveredPackage model https://github.com/nexB/scancode.io/pull/222 Make sure to use the DiscoveredPackage.update_extra_data() API, see https://github.com/nexB/scancode.io/blob/main/scanpipe/models.py#L197

tdruez avatar Jul 01 '21 16:07 tdruez

@tdruez @pombredanne

There is another issue that i found looking at my database. Some alpine images like docker://alpine:3.13 have subpackages installed from a parent package that is not installed (libcrypto, libssl from openssl). Handling cases like these isn't clear for me (because running the scan on the parent still leaves a problem of deciding what subset of the scan's output is applicable to the parent's children) I've decided that in cases where a parent package is present, my code will update only its missing information (not its chilrdren).

quepop avatar Jul 05 '21 03:07 quepop

@quepop not sure what to do wrt. subpackages... can you provide a concrete example with links and may sample APKBUILD snippets?

pombredanne avatar Jul 16 '21 12:07 pombredanne

re:

@quepop not sure what to do wrt. subpackages... can you provide a concrete example with links and may sample APKBUILD snippets?

Actually the new approach in https://github.com/nexB/scancode-toolkit/blob/b7d070fc788f4c233df1f31d735e1ac5b3aa2d29/src/packagedcode/alpine.py#L144 could be enhanced to also collect these.

  1. collect the variable and resolve that lists all the subpackages such as https://gitlab.alpinelinux.org/alpine/aports/-/blob/9238f96475ab45281ce3d0cdfbf37efcbdeb8888/main/openssl/APKBUILD#L14
  2. for each of these, parse the corresponding bash function with the same name and extract the variables defined inside a back function

Then merge subpackages variables with the base variables.

This way declared licensed, description and other sub-package-specific metadata could be collected.

Note that this would work whereas the approach to just eval the APKBUILD with bash would not be able to collect variables defined inside a function, as this would require to run the function, which is impractical since the function names are not known (unless parsed ...) .

pombredanne avatar Aug 10 '21 12:08 pombredanne