scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Design ways to share ScanCode.io scans between DejaCode and PurlDB

Open pombredanne opened this issue 10 months ago • 0 comments

Context

  • DejaCode has a captive ScanCode.io worker and pushes scan requests there based on UI requests
  • PurlDB now has one or more ScanCode.io workers that are tethering and consuming a PurlDB queue of pending scans (it had until now a captive ScanCode.io worker like DejaCode)

Problem

When I deploy a full AboutCode stack I would like to avoid rescanning code twice: once in the DJCD SCIO(s) and once in the PurlDB SCIO(s).

Why? because duplicated scans are a waste of compute resources and slows down access to scans.

Solution

Here are a few solution elements to avoid re-running scans:

  • We could avoid running two scan queues: one in DJCD and one in PurlDB. Here DejaCode could always request a scan through PurlDB and therefore would not have its own captive ScanCode.io.

    • A caveat is that a dedicated private ScanCode.io may have been configured for privileged access to private resources and we would have to support this somehow.
  • We could design ways to query for a scan project from a pool of SCIO instances to get the scan if available.

    • Here the difficult issue is how to identify which scan project matches the code/scan combo we need. This identity includes naming the project but also the project inputs, the set of pipelines that ran in a certain sequence, the state of these pipeline runs, and the actual version of the SCIO used for a run.

    • In this later approach, we could expose a lookup endpoint on SCIO that would return projects that match the inputs checksums, the pipelines and their state, and the SCIO version (possibly abstracted to a simple number as there are cases when a different newer version of SCIO may return exactly the same results for a given set of pipelines... not all SCIO version changes invalidate all the scan runs)

    • Or using the same concepts, we could shortcut and speedup the scans in an SCIO when a project is identified as the same as previous projects with the same criteria as above.

  • Another approach would to rely on the emerging approach of FederatedCode where scans are serialized, and stored for later access and reuse.

pombredanne avatar Mar 27 '24 08:03 pombredanne