scancode.io
scancode.io copied to clipboard
Design ways to share ScanCode.io scans between DejaCode and PurlDB
Context
- DejaCode has a captive ScanCode.io worker and pushes scan requests there based on UI requests
- PurlDB now has one or more ScanCode.io workers that are tethering and consuming a PurlDB queue of pending scans (it had until now a captive ScanCode.io worker like DejaCode)
Problem
When I deploy a full AboutCode stack I would like to avoid rescanning code twice: once in the DJCD SCIO(s) and once in the PurlDB SCIO(s).
Why? because duplicated scans are a waste of compute resources and slows down access to scans.
Solution
Here are a few solution elements to avoid re-running scans:
-
We could avoid running two scan queues: one in DJCD and one in PurlDB. Here DejaCode could always request a scan through PurlDB and therefore would not have its own captive ScanCode.io.
- A caveat is that a dedicated private ScanCode.io may have been configured for privileged access to private resources and we would have to support this somehow.
-
We could design ways to query for a scan project from a pool of SCIO instances to get the scan if available.
-
Here the difficult issue is how to identify which scan project matches the code/scan combo we need. This identity includes naming the project but also the project inputs, the set of pipelines that ran in a certain sequence, the state of these pipeline runs, and the actual version of the SCIO used for a run.
-
In this later approach, we could expose a lookup endpoint on SCIO that would return projects that match the inputs checksums, the pipelines and their state, and the SCIO version (possibly abstracted to a simple number as there are cases when a different newer version of SCIO may return exactly the same results for a given set of pipelines... not all SCIO version changes invalidate all the scan runs)
-
Or using the same concepts, we could shortcut and speedup the scans in an SCIO when a project is identified as the same as previous projects with the same criteria as above.
-
-
Another approach would to rely on the emerging approach of FederatedCode where scans are serialized, and stored for later access and reuse.