scancode-toolkit fast-scan: Plan and create benchmarks for performance improvements

trafficstars

ScanCode is accurate, but could be made much faster. Feedback from community users often include complaints about the speed of scans. This addresses these concerns with a focused initiative to improve the performance for both ScanCode Toolkit and ScanCode.io.

To scan faster, we need to start by measuring. This means creating benchmarks to establish a baseline for each scanner and tool. We should establish and publish a baseline benchmark and profile scan performance hotspots as part of the efforts to improve overall performance of scancode toolkit and scancode.io.

[ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4059
[ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4060
[ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4061
[ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4062
[ ] https://github.com/aboutcode-org/scancode.io/issues/1499
[ ] https://github.com/aboutcode-org/scancode.io/issues/1500

Jan 05 '25 19:01 pombredanne

From the initial measurement of separate License, Copyright and Package scans for:

Android: https://github.com/aosp-mirror/platform_frameworks_base/archive/refs/tags/android-15.0.0_r23.tar.gz
Kernel: https://github.com/torvalds/linux/archive/refs/tags/v6.13.tar.gz
SCTK: https://github.com/aboutcode-org/scancode-toolkit/releases/download/v32.3.3/scancode-toolkit-v32.3.3_py3.10-linux.tar.gz

My initial observations are:

The results are extremely variable (more than I expected)
The scans with -timing option are all slower (Heisenberg effect ?)
The License Scan of SCTK was stuck for more than 24 hours - likely because of the very high volume of license related data in /licensedcode/data/. This is a predictable Heisenberg effect so we should pick a different AboutCode project for the benchmarking (at least until the /licensedcode/data/ is packaged separately

Two next steps:

Replace SCTK with another codebase, AboutCode or other
Run Scans with -clp for comparison

Apr 25 '25 22:04 mjherzog

The benchmark samples are uploaded: https://drive.google.com/drive/folders/1qOfE7kqfsyT5LTPdGIGMOYyB26dW0Ltu

Apr 30 '25 23:04 chinyeungli

@chinyeungli we need to run the benchmarks... is that what you did above? Could we put the data and results in a git repo instead?

Jun 12 '25 15:06 pombredanne

@pombredanne Chin Yeung created sample data for review with a small set. We need to discuss what a benchmark looks like in much more detail including:

How many different types of scans - e.g wouldn't a -clp scan be a primary use case?
What is the list of target projects to scan? SCTK was not a good choice because of the Heisenberg problem with license data. Do we have a list of projects somewhere that we otherwise use for SCTK testing? I can work with Chin Yeung to rerun a better sample set with your input to the questions.

Jun 12 '25 15:06 mjherzog

Could we put the data and results in a git repo instead? where do we want to put it?

as michael said, it's probably a better idea to re-run with your input for a better sample set

Jun 12 '25 23:06 chinyeungli

@chinyeungli Please take a look at: https://github.com/aboutcode-org/popular-package-purls/blob/main/popular-purls.json which is a very large list of popular PURLs in JSON format. Please convert the data to a simple XLSX format so that we can slice and dice it to get a representative set of packages for the benchmark. This will be a very long list but it seems to be a useful starting point. The data seems to be PURLS and the count of dependents. It does not tell us anything about how interesting these packages are from a license/copyright scan perspective. Philippe also suggested that we should include a kernal codebase and some Docker images in the benchmark which means SCIO in addition to SCTK.

Jun 13 '25 01:06 mjherzog

converted the JSON to XLSX uploaded to the aboutcode.org's gdrive (https://docs.google.com/spreadsheets/d/1dooZIvOw-IkZy62WDWWuv64EakUCDUgx/edit?usp=drive_link&ouid=111292578521869099532&rtpof=true&sd=true)

It has 249,999 entries

cargo: 49,999
golang: 50,000
maven: 50,000
npm: 50,000
pypi: 50,000

@mjherzog let me know what's the next step

Jun 16 '25 08:06 chinyeungli

I updated the popular-purls spreadsheet to v0.10 and added a sheet reducing the list to package with 1000 or more dependencies which reduces the count to 9839. This is somewhat arbitrary but we generally want to use more popular / well-known packages for the benchmark (and that will be easiest to explain). I cannot think of any specific way to determine the size or complexity of the packages so let start from the top and work our way down. As a start please scan the source for top 5 of each PURL Type on deja08 and see what we get.

Jun 17 '25 01:06 mjherzog

If you have time please also run scans for the corresponding distribution packages.

Jun 17 '25 02:06 mjherzog

So our reset items are to:

Update existing Android, Kernel and popular-purls files to correct errors (do not worry about adding more data to popular-purls)
Rerun the Android and Kernel scans with -clip with and without -timing option
Report issue to correct Scan Speed label in the Summary to resources/sec or change the calculation to files/sec instead of resources/sec
Investigate different total time discrepancy between the summary and the sum of file-level timing data

Jun 18 '25 00:06 mjherzog

I started a repo for the benchmarking of scancode-toolkit: https://github.com/aboutcode-org/scancode-benchmark

Jul 09 '25 19:07 JonoYang

scancode-toolkit scancode-toolkit copied to clipboard

fast-scan: Plan and create benchmarks for performance improvements

scancode-toolkit
scancode-toolkit copied to clipboard