scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

multi-gigabyte memory-spikes when processing large files

Open petergardfjall opened this issue 1 year ago • 2 comments

Description

Normally scancode is quite predictable in its memory use. It sits around 1GB per process used (--processes), the absolute majority of the memory being occupied by the rule index. However, when large files (1MB+) are encountered memory consumption becomes much less predictable and can spike to more than twice that amount per process (in rare cases it can explode to 12GB for one individual file!).

I cannot judge whether this is due to some algorithmic inefficiency or would require a new approach such as chunked processing of files. Either way it would be nice to have memory consumption more predictable, as it has been found to result in out-of-memory kills.

Normal memory use can be illustrated by running scancode against go-git, which only contains small files (<10Kb). As can be seen in the graph below memory use is very stable:

cd /tmp
git clone https://github.com/go-git/go-git && cd go-git
git checkout -b v5_0_0 tags/v5.0.0
scancode --json-pp scan.json -n 0 --timeout 600 --license --license-text --license-references /tmp/go-git/

go-git-scan

However, when large files (1MB+) are encountered memory consumption becomes much less predictable and can spike to more than twice that amount (update: in rare cases I've seen it spike at almost 12GB on a file!) . This directory from the Linux kernel is a good example of where scancode struggles to be memory-efficient. Notably dcn_3_2_0_sh_mask.h, a 22MB, file, will cause scancode to spike at 2.2GB.

cd /tmp
git clone https://github.com/torvalds/linux && cd linux
git checkout -b v6_7 tags/v6.7
scancode --json-pp scan.json -n 0 --timeout 600 --license --license-text --license-references /tmp/linux/drivers/gpu/drm/amd/include/asic_reg

linux-scan

So that was with one single process. Now, consider running with --processes=8 or similar and encountering many files like these. You can imagine the wild total memory spikes you might end up with.

~~One shouldn't rule out there also being memory leaks, but at least it seems like most of the memory spikes get reclaimed (although memory never seems to drop all the way back to 1.0GB).~~(update: in rare cases like these, which spiked at 12GB the memory did not appear to be reclaimed, suggesting a memory leak)

How To Reproduce

To reproduce one case where memory balloons, try the following:

python -m venv .venv
. .venv/bin/activate
pip install scancode-toolkit==32.0.8

wget https://raw.githubusercontent.com/torvalds/linux/v6.7/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_2_0_sh_mask.h
scancode --json-pp scan.json -n 0 --timeout 600 --license --license-text --license-references dcn_3_2_0_sh_mask.h

You should observe scancode consuming close to 2.2GB of memory for the single process that is working.

System configuration

  • What OS are you running on? (Windows/MacOS/Linux)

Linux.

  • What version of scancode-toolkit was used to generate the scan file?

32.0.8

  • What installation method was used to install/run scancode? (pip/source download/other)

pip

petergardfjall avatar Mar 28 '24 14:03 petergardfjall

Update: I have found a couple files from the Apache Camel project that really pushes scancode to its limits.

  • camel-sbom.json: this took 6m27s for scancode to scan and it used about 6GB of memory.
  • camel-sbom.xml: this took 7m17s for scancode to scan and used a whopping 11.8GB (!) to scan.

(note: I had to bump the timeout from the default 120 to --timeout 600)

These are admittedly cruel examples, but they go to show that there are edge cases like this that scancode cannot reasonably handle. And ideally you'd like to just throw scancode at any code base and expect it to complete (without blowing up), even in the presence of files like these.

What's worse is that in both these cases the memory was not reclaimed, but scancode continued to work with a high memory footprint, so in deed there may be a memory leak involved, contrary to what I first noted in the description.

I've updated the issue description to incorporate these new observations.

petergardfjall avatar Apr 11 '24 11:04 petergardfjall

It should be noted that the https://github.com/nexB/scancode-plugins/tree/main/misc/scancode-ignore-binaries plugin (offering an --ignore-binaries cli flag) appears quite useful to avoid some of the problematic files mentioned above.

petergardfjall avatar Apr 12 '24 13:04 petergardfjall