multi-gigabyte memory-spikes when processing large files
Description
Normally scancode is quite predictable in its memory use. It sits around 1GB per process used (--processes), the absolute majority of the memory being occupied by the rule index. However, when large files (1MB+) are encountered memory consumption becomes much less predictable and can spike to more than twice that amount per process (in rare cases it can explode to 12GB for one individual file!).
I cannot judge whether this is due to some algorithmic inefficiency or would require a new approach such as chunked processing of files. Either way it would be nice to have memory consumption more predictable, as it has been found to result in out-of-memory kills.
Normal memory use can be illustrated by running scancode against go-git, which only contains small files (<10Kb). As can be seen in the graph below memory use is very stable:
cd /tmp
git clone https://github.com/go-git/go-git && cd go-git
git checkout -b v5_0_0 tags/v5.0.0
scancode --json-pp scan.json -n 0 --timeout 600 --license --license-text --license-references /tmp/go-git/
However, when large files (1MB+) are encountered memory consumption becomes much less predictable and can spike to more than twice that amount (update: in rare cases I've seen it spike at almost 12GB on a file!) . This directory from the Linux kernel is a good example of where scancode struggles to be memory-efficient. Notably dcn_3_2_0_sh_mask.h, a 22MB, file, will cause scancode to spike at 2.2GB.
cd /tmp
git clone https://github.com/torvalds/linux && cd linux
git checkout -b v6_7 tags/v6.7
scancode --json-pp scan.json -n 0 --timeout 600 --license --license-text --license-references /tmp/linux/drivers/gpu/drm/amd/include/asic_reg
So that was with one single process. Now, consider running with --processes=8 or similar and encountering many files like these. You can imagine the wild total memory spikes you might end up with.
~~One shouldn't rule out there also being memory leaks, but at least it seems like most of the memory spikes get reclaimed (although memory never seems to drop all the way back to 1.0GB).~~(update: in rare cases like these, which spiked at 12GB the memory did not appear to be reclaimed, suggesting a memory leak)
How To Reproduce
To reproduce one case where memory balloons, try the following:
python -m venv .venv
. .venv/bin/activate
pip install scancode-toolkit==32.0.8
wget https://raw.githubusercontent.com/torvalds/linux/v6.7/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_2_0_sh_mask.h
scancode --json-pp scan.json -n 0 --timeout 600 --license --license-text --license-references dcn_3_2_0_sh_mask.h
You should observe scancode consuming close to 2.2GB of memory for the single process that is working.
System configuration
- What OS are you running on? (Windows/MacOS/Linux)
Linux.
- What version of scancode-toolkit was used to generate the scan file?
32.0.8
- What installation method was used to install/run scancode? (pip/source download/other)
pip
Update: I have found a couple files from the Apache Camel project that really pushes scancode to its limits.
- camel-sbom.json: this took
6m27sfor scancode to scan and it used about6GBof memory. - camel-sbom.xml: this took
7m17sfor scancode to scan and used a whopping11.8GB(!) to scan.
(note: I had to bump the timeout from the default 120 to --timeout 600)
These are admittedly cruel examples, but they go to show that there are edge cases like this that scancode cannot reasonably handle. And ideally you'd like to just throw scancode at any code base and expect it to complete (without blowing up), even in the presence of files like these.
What's worse is that in both these cases the memory was not reclaimed, but scancode continued to work with a high memory footprint, so in deed there may be a memory leak involved, contrary to what I first noted in the description.
I've updated the issue description to incorporate these new observations.
It should be noted that the https://github.com/nexB/scancode-plugins/tree/main/misc/scancode-ignore-binaries plugin (offering an --ignore-binaries cli flag) appears quite useful to avoid some of the problematic files mentioned above.