scancode.io
scancode.io copied to clipboard
ScanCode.io: Support multiple scan workers systems
This includes updating the scanning architecture of PurlDB to accommodate multiple ScanCode.io worker systems (whole machines) and expose a queue API where ScanCode.io instances can pick a scanning job to run symbol collection
The original design was to have multiple scan queues: one for a scan proper, another for symbols and another for strings and so on. We decided on a different design where we have only one scan queue as designed in #290 and we are instead collecting everything form a single and ensuring we have a single pipeline that support multi tools/indexers at once.
See https://github.com/nexB/purldb/pull/290
This pull request adds a new endpoint that exposes the scanning queue pending requests.
- The design is exposed in https://github.com/nexB/purldb/issues/236
- The client side is in https://github.com/nexB/scancode.io/pull/1078
- The original issue that started this work is at https://github.com/nexB/purldb/issues/49
We eventually added the ability to support multiple pipelines for data collection.
The previous design to have a single list of pipelines running in all cases for any package was not scalable at all as it would have required to run too much on too many packages all the times.
Eventually the solution is to be able to request one or more pipelines when collecting the package data and support running multiple pipelines in each of the scancode.io workers that run the effective scan.
The latest work for this has just been merged with this PR:
- https://github.com/nexB/purldb/pull/393
The way to test this is in this comment:
- https://github.com/nexB/purldb/pull/393#issue-2249213184