scancode-toolkit
scancode-toolkit copied to clipboard
fast-scan: Plan and create benchmarks for performance improvements
ScanCode is accurate, but could be made much faster. Feedback from community users often include complaints about the speed of scans. This addresses these concerns with a focused initiative to improve the performance for both ScanCode Toolkit and ScanCode.io.
To scan faster, we need to start by measuring. This means creating benchmarks to establish a baseline for each scanner and tool. We should establish and publish a baseline benchmark and profile scan performance hotspots as part of the efforts to improve overall performance of scancode toolkit and scancode.io.
- [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4059
- [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4060
- [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4061
- [ ] https://github.com/aboutcode-org/scancode-toolkit/issues/4062
- [ ] https://github.com/aboutcode-org/scancode.io/issues/1499
- [ ] https://github.com/aboutcode-org/scancode.io/issues/1500
From the initial measurement of separate License, Copyright and Package scans for:
- Android: https://github.com/aosp-mirror/platform_frameworks_base/archive/refs/tags/android-15.0.0_r23.tar.gz
- Kernel: https://github.com/torvalds/linux/archive/refs/tags/v6.13.tar.gz
- SCTK: https://github.com/aboutcode-org/scancode-toolkit/releases/download/v32.3.3/scancode-toolkit-v32.3.3_py3.10-linux.tar.gz
My initial observations are:
- The results are extremely variable (more than I expected)
- The scans with -timing option are all slower (Heisenberg effect ?)
- The License Scan of SCTK was stuck for more than 24 hours - likely because of the very high volume of license related data in /licensedcode/data/. This is a predictable Heisenberg effect so we should pick a different AboutCode project for the benchmarking (at least until the /licensedcode/data/ is packaged separately
Two next steps:
- Replace SCTK with another codebase, AboutCode or other
- Run Scans with -clp for comparison
The benchmark samples are uploaded: https://drive.google.com/drive/folders/1qOfE7kqfsyT5LTPdGIGMOYyB26dW0Ltu
@chinyeungli we need to run the benchmarks... is that what you did above? Could we put the data and results in a git repo instead?
@pombredanne Chin Yeung created sample data for review with a small set. We need to discuss what a benchmark looks like in much more detail including:
- How many different types of scans - e.g wouldn't a -clp scan be a primary use case?
- What is the list of target projects to scan? SCTK was not a good choice because of the Heisenberg problem with license data. Do we have a list of projects somewhere that we otherwise use for SCTK testing? I can work with Chin Yeung to rerun a better sample set with your input to the questions.
Could we put the data and results in a git repo instead? where do we want to put it?
as michael said, it's probably a better idea to re-run with your input for a better sample set
@chinyeungli Please take a look at: https://github.com/aboutcode-org/popular-package-purls/blob/main/popular-purls.json which is a very large list of popular PURLs in JSON format. Please convert the data to a simple XLSX format so that we can slice and dice it to get a representative set of packages for the benchmark. This will be a very long list but it seems to be a useful starting point. The data seems to be PURLS and the count of dependents. It does not tell us anything about how interesting these packages are from a license/copyright scan perspective. Philippe also suggested that we should include a kernal codebase and some Docker images in the benchmark which means SCIO in addition to SCTK.
converted the JSON to XLSX uploaded to the aboutcode.org's gdrive (https://docs.google.com/spreadsheets/d/1dooZIvOw-IkZy62WDWWuv64EakUCDUgx/edit?usp=drive_link&ouid=111292578521869099532&rtpof=true&sd=true)
It has 249,999 entries
- cargo: 49,999
- golang: 50,000
- maven: 50,000
- npm: 50,000
- pypi: 50,000
@mjherzog let me know what's the next step
I updated the popular-purls spreadsheet to v0.10 and added a sheet reducing the list to package with 1000 or more dependencies which reduces the count to 9839. This is somewhat arbitrary but we generally want to use more popular / well-known packages for the benchmark (and that will be easiest to explain). I cannot think of any specific way to determine the size or complexity of the packages so let start from the top and work our way down. As a start please scan the source for top 5 of each PURL Type on deja08 and see what we get.
If you have time please also run scans for the corresponding distribution packages.
So our reset items are to:
- Update existing Android, Kernel and popular-purls files to correct errors (do not worry about adding more data to popular-purls)
- Rerun the Android and Kernel scans with -clip with and without -timing option
- Report issue to correct Scan Speed label in the Summary to resources/sec or change the calculation to files/sec instead of resources/sec
- Investigate different total time discrepancy between the summary and the sum of file-level timing data
I started a repo for the benchmarking of scancode-toolkit: https://github.com/aboutcode-org/scancode-benchmark