scancode.io Design ways to share ScanCode.io scans between DejaCode and PurlDB

Design ways to share ScanCode.io scans between DejaCode and PurlDB

Open pombredanne opened this issue 10 months ago • 0 comments

DejaCode has a captive ScanCode.io worker and pushes scan requests there based on UI requests
PurlDB now has one or more ScanCode.io workers that are tethering and consuming a PurlDB queue of pending scans (it had until now a captive ScanCode.io worker like DejaCode)

When I deploy a full AboutCode stack I would like to avoid rescanning code twice: once in the DJCD SCIO(s) and once in the PurlDB SCIO(s).

Why? because duplicated scans are a waste of compute resources and slows down access to scans.

Here are a few solution elements to avoid re-running scans:

We could avoid running two scan queues: one in DJCD and one in PurlDB. Here DejaCode could always request a scan through PurlDB and therefore would not have its own captive ScanCode.io.
- A caveat is that a dedicated private ScanCode.io may have been configured for privileged access to private resources and we would have to support this somehow.
We could design ways to query for a scan project from a pool of SCIO instances to get the scan if available.
- Here the difficult issue is how to identify which scan project matches the code/scan combo we need. This identity includes naming the project but also the project inputs, the set of pipelines that ran in a certain sequence, the state of these pipeline runs, and the actual version of the SCIO used for a run.
- In this later approach, we could expose a lookup endpoint on SCIO that would return projects that match the inputs checksums, the pipelines and their state, and the SCIO version (possibly abstracted to a simple number as there are cases when a different newer version of SCIO may return exactly the same results for a given set of pipelines... not all SCIO version changes invalidate all the scan runs)
- Or using the same concepts, we could shortcut and speedup the scans in an SCIO when a project is identified as the same as previous projects with the same criteria as above.
Another approach would to rely on the emerging approach of FederatedCode where scans are serialized, and stored for later access and reuse.

Mar 27 '24 08:03 pombredanne