scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

fast-scan: Plan and create benchmarks for performance improvements

Open pombredanne opened this issue 10 months ago • 6 comments
trafficstars

ScanCode is accurate, but could be made much faster. Feedback from community users often include complaints about the speed of scans. This addresses these concerns with a focused initiative to improve the performance for both ScanCode Toolkit and ScanCode.io.

To scan faster, we need to start by measuring. This means creating benchmarks to establish a baseline for each scanner and tool. We should establish and publish a baseline benchmark and profile scan performance hotspots as part of the efforts to improve overall performance of scancode toolkit and scancode.io.

  • [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4059
  • [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4060
  • [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4061
  • [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4062
  • [ ] https://github.com/aboutcode-org/scancode.io/issues/1499
  • [ ] https://github.com/aboutcode-org/scancode.io/issues/1500

pombredanne avatar Jan 05 '25 19:01 pombredanne

From the initial measurement of separate License, Copyright and Package scans for:

  • Android: https://github.com/aosp-mirror/platform_frameworks_base/archive/refs/tags/android-15.0.0_r23.tar.gz
  • Kernel: https://github.com/torvalds/linux/archive/refs/tags/v6.13.tar.gz
  • SCTK: https://github.com/aboutcode-org/scancode-toolkit/releases/download/v32.3.3/scancode-toolkit-v32.3.3_py3.10-linux.tar.gz

My initial observations are:

  • The results are extremely variable (more than I expected)
  • The scans with -timing option are all slower (Heisenberg effect ?)
  • The License Scan of SCTK was stuck for more than 24 hours - likely because of the very high volume of license related data in /licensedcode/data/. This is a predictable Heisenberg effect so we should pick a different AboutCode project for the benchmarking (at least until the /licensedcode/data/ is packaged separately

Two next steps:

  • Replace SCTK with another codebase, AboutCode or other
  • Run Scans with -clp for comparison

mjherzog avatar Apr 25 '25 22:04 mjherzog

The benchmark samples are uploaded: https://drive.google.com/drive/folders/1qOfE7kqfsyT5LTPdGIGMOYyB26dW0Ltu

chinyeungli avatar Apr 30 '25 23:04 chinyeungli

@chinyeungli we need to run the benchmarks... is that what you did above? Could we put the data and results in a git repo instead?

pombredanne avatar Jun 12 '25 15:06 pombredanne

@pombredanne Chin Yeung created sample data for review with a small set. We need to discuss what a benchmark looks like in much more detail including:

  • How many different types of scans - e.g wouldn't a -clp scan be a primary use case?
  • What is the list of target projects to scan? SCTK was not a good choice because of the Heisenberg problem with license data. Do we have a list of projects somewhere that we otherwise use for SCTK testing? I can work with Chin Yeung to rerun a better sample set with your input to the questions.

mjherzog avatar Jun 12 '25 15:06 mjherzog

Could we put the data and results in a git repo instead? where do we want to put it?

as michael said, it's probably a better idea to re-run with your input for a better sample set

chinyeungli avatar Jun 12 '25 23:06 chinyeungli

@chinyeungli Please take a look at: https://github.com/aboutcode-org/popular-package-purls/blob/main/popular-purls.json which is a very large list of popular PURLs in JSON format. Please convert the data to a simple XLSX format so that we can slice and dice it to get a representative set of packages for the benchmark. This will be a very long list but it seems to be a useful starting point. The data seems to be PURLS and the count of dependents. It does not tell us anything about how interesting these packages are from a license/copyright scan perspective. Philippe also suggested that we should include a kernal codebase and some Docker images in the benchmark which means SCIO in addition to SCTK.

mjherzog avatar Jun 13 '25 01:06 mjherzog

converted the JSON to XLSX uploaded to the aboutcode.org's gdrive (https://docs.google.com/spreadsheets/d/1dooZIvOw-IkZy62WDWWuv64EakUCDUgx/edit?usp=drive_link&ouid=111292578521869099532&rtpof=true&sd=true)

It has 249,999 entries

  • cargo: 49,999
  • golang: 50,000
  • maven: 50,000
  • npm: 50,000
  • pypi: 50,000

@mjherzog let me know what's the next step

chinyeungli avatar Jun 16 '25 08:06 chinyeungli

I updated the popular-purls spreadsheet to v0.10 and added a sheet reducing the list to package with 1000 or more dependencies which reduces the count to 9839. This is somewhat arbitrary but we generally want to use more popular / well-known packages for the benchmark (and that will be easiest to explain). I cannot think of any specific way to determine the size or complexity of the packages so let start from the top and work our way down. As a start please scan the source for top 5 of each PURL Type on deja08 and see what we get.

mjherzog avatar Jun 17 '25 01:06 mjherzog

If you have time please also run scans for the corresponding distribution packages.

mjherzog avatar Jun 17 '25 02:06 mjherzog

So our reset items are to:

  • Update existing Android, Kernel and popular-purls files to correct errors (do not worry about adding more data to popular-purls)
  • Rerun the Android and Kernel scans with -clip with and without -timing option
  • Report issue to correct Scan Speed label in the Summary to resources/sec or change the calculation to files/sec instead of resources/sec
  • Investigate different total time discrepancy between the summary and the sum of file-level timing data

mjherzog avatar Jun 18 '25 00:06 mjherzog

I started a repo for the benchmarking of scancode-toolkit: https://github.com/aboutcode-org/scancode-benchmark

JonoYang avatar Jul 09 '25 19:07 JonoYang