scancode-toolkit
scancode-toolkit copied to clipboard
Implement Alpine APKBUILD parser in packagedcode
Short Description
Add Alpine's APKBUILD (apk package recipe) parser that would live in src/packagedcode/alpine_build.py and return a Package object.
Possible Labels
copyright scan email and url scan license scan
Select Category
- [ x ] Enhancement
- [ x ] Add License/Copyright
- [ ] Scan Feature
- [ ] Packaging
- [ ] Documentation
- [ x ] Expand Support
- [ ] Other
Describe the Update
Alpine packages lack some necessary information to generate a compliance report (e.g. copyright, full license text, source code & patches). Those are avaliable only in the aports repository (each package has reference to commit sha in aports repo, specifically in the APKBUILD files. This code would be later used in scancode.io to create a pipeline that would get recipes for packages-> parse them and get source code, pathces, etc -> scan them & add missing information gathered from the package recipe & its code
How This Feature will help you/your organization
At ONAP we're trying to switch our images to Alpine, as it is a GPLv3 free base image (ONAP Technical Steering Committee decided to avoid GPLv3 as much as possible) This will be a brick towards having complete information about alpine pkgs in scancode.io to be able to generate compliance documentation.
Possible Solution/Implementation Details
One issue found so far is bash param subst being used in the recipes which needs to be handled.
Example/Links if Any
https://wiki.alpinelinux.org/wiki/APKBUILD_Reference https://wiki.alpinelinux.org/wiki/APKBUILD_examples:Multiple_Subpackages
a bit related to #2061
Can you help with this Feature
@quepop
As @aalexanderr probably mentioned I'm working on the feature right now. Ive already implemented fetching and parsing but I'm not sure how to split my code so it would fit properly. I think it should look something like this:
- When
scancode.io
analyzes a new alpine docker image it requests a package object list from thepackagedcode/alpine
- The
packagedcode/alpine
extracts installed packages and their info from an alpine db that lives inside that docker image -
build_package()
orget_installed_packages()
runs some function(s) from thepackagedcode/alpine_build
to extract the missing data before returning package object(s) -
packagecode/alpine_build
downloads the needed resources (aports repo) using fetchcode and parses package-specificAPKBUILD
to extract source code download urls and possibly more missing data.
Should the packagecode/alpine_build
only provide source urls for a package (so the rest would be handled in scancode.io
) or should it also handle copyright extraction from the source code?
The latter would be consistent with how for example packagedcode/debian
handles copyrights - scancode.io
recieves a package object list that already has copyrights info.
@quepop
I understand it as follows:
the APKBUILD
parser should currently live in scancode-toolkit.packagedcode/alpine_build.py
- as it is the most logical place to have it right now without creating a new pkg.
IMHO handling the aports repo (as in downloading, checking it out on specific commits) should be handled in scancode.io (using fetchcode
) as from what I've understand scancode-toolkit does not download any supporting stuff, it just analyzes what is given to it.
Later down the line both alpine_build.py & handling aports repo could be separated to alpine-inspector
package ( a bit similar to https://github.com/nexB/debian-inspector )
I think we should use a cache dir to be able to reuse scan results (their id would be a combination of a package name and its version) so executing a pipeline on a new alpine docker image (project) could save some time (if ofc said alpine docker image has a package name - version combination that existed in previous projects)
@quepop Thank you++ I think in terms of code organization, things that are specific to Alpine should be in an alpine module. Things that would be generic (such as downloading each detected package sources and scanning for licenses) may be best in scancode.io for now?
This needs a bit thinking though do not let that slow you down! Here is a quick idea as a base:
-
Have a new alpine-specific module in scancode.io
/pipes/
that does fetch and scan an Alpine package. The input would be a package URL, and possibly if needed an Alpine version. The output would would be updated package information (e.g. more or less a packagedcode.models.Package data structure. -
for now, have a new pipeline with very few steps that would loop through the DB searching for installed alpine packages and call the above for each, then save the results in the db.
@tdruez ^ FYI.
As discussed in https://github.com/nexB/purldb/issues/307 I am not super comfy with running arbitrary shell scripts during a scan. I reckon that APKBUILD may not be completely arbitrary and random but once plugged as a package manifest parser we could stumble on ill-formed or ill-intented and malicious APKBUILD files... therefore the approach of a static parsing and evaluation would be much better even though there could be a few kinks to handle left and right at scale, this feel a much safer approach.
For this I started a this PR https://github.com/nexB/scancode-toolkit/pull/2598 that can parse and evaluate top-level variables in an APKBUILD. It does not deal with subpackages defined in functions for now... but evaluating in a shell an APKBUILD would neither and the build would need to be launched to get the full details anyway.