scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Scan package files and extract for packages

Open AyanSinhaMahapatra opened this issue 9 months ago • 0 comments

In all the following pipelines:

  • rootfs
  • docker
  • docker-windows
  • scan_codebase when we scan files for license, copyright and others, we are skipping the scan for codebase resources which have a status already before this step, and so anything tagged as application-package or system-package will not be scanned.

In the match_not_analyzed_to_system_packages pipe of the rootfs pipeline, we are matching all codebase resources which are a part of that package to the discovered package object and also updating it's status to system-package. (It seems like earlier we were also doing this for application packages with the match_not_analyzed_to_application_packages function, but this is not used anywhere after this)

Similary in the docker pipelines, in the create_system_package function of the collect_and_create_system_packages step we are updating the status of package files to system-package.

We can either:

  1. stop tagging the status of files which are part of a system-package
  2. or re-scan all package files tagged as system/application package

In this PR I've tried out the 2. approach, as this is what we do in SCTK also, but here we have to create a new argument update_status and pass it on to the function which saves data to resources after the scan to not overwrite the system-package or application-package status for codebase-resources to scanned, which was a side-effect of the file scans.

Since all these pipelines already did scan application package files (which were not metadata files/lockfiles) I'm assuming we also want to scan the metadata files which were not being scanned? Otherwise #762 does not make any sense. Note here that license scans which are part of a package scan (parsing the manifest and then only running license detection on the extracted part) can be different in some complex files than a simple license scan of the file, and we might need to improve how we handle this in SCTK to avoid confusion. See https://github.com/nexB/scancode-toolkit/issues/3024 for details

Reference: https://github.com/nexB/scancode.io/issues/762 Reference: https://github.com/nexB/scancode.io/issues/1194 Reference: https://github.com/nexB/scancode.io/issues/83

AyanSinhaMahapatra avatar May 06 '24 14:05 AyanSinhaMahapatra