scancode-toolkit Feature Request: Parallalize all steps of scancode if possible

Short Description

In all but the file scanning step, scancode only uses one core, resulting in a low CPU load and a long runtime.

Select Category

[X] Enhancement

How This Feature will help you/your organization

It should reduce the overall runtime substantially.

Possible Solution/Implementation Details

I started scancode on a large codebase with parameter -n 30. After two hours (on a 32 core CPU), it was still in the state Collect file inventory.... The entire code base is on a M.2 SSD. The task manager shows little CPU, Memory, and IO activity.

Please parallelize all steps of scancode, including the file collection step.

Hardware / Software used

AMD Ryzen 9 3950X (16 core, 32 thread), 32GB RAM, M.2 SSD 970 EVO Plus 1TB Windows 10 Enterprise LTSC (1809), Python 3.9, Scancode 30.0.0

Oct 07 '21 11:10 FrankHeimes

@FrankHeimes Thank you... yes this makes perfect sense, in particular when scanning whole devices where there can a lot of files and just the basic directory listing could benefit from speedups. Do you have experience on doing such parallelization for what could be described as a parallelized file system walk?

Oct 08 '21 05:10 pombredanne

@pombredanne No, I don't. But I just did a quick test: Walking all files and folders recursively using PowerShell took 11 seconds, enumerating only the folders took 5 seconds; both for the large codebase mentioned above. That is reasonably fast so that this doesn't require parallelization. Apparently, the processing of each individual file (reading, classification, etc.) takes a lot of time.

So I'd do it like this:

Enumerate all files recursively; only recording their full path in a list in memory.
Distribute the filenames on all available cores for further processing in parallel, i. e. classification or whatever needs to be done.

Oct 08 '21 08:10 FrankHeimes

A slightly different approach would be to immediately start scanning while the file inventory process is still running. So even before the file inventory would be complete different threads would already start processing files. This would alleviate the pain somewhat (unless inventory for a single file takes longer than scanning that single file).

I also agree that as little as possible should be done in the inventory phase. By combining these approaches (moving processing outside of file inventory step and already starting to process during inventory) it would probably make the best use of resources.

Dec 05 '23 17:12 armijnhemel

scancode-toolkit scancode-toolkit copied to clipboard

Feature Request: Parallalize all steps of scancode if possible

Short Description

Select Category

How This Feature will help you/your organization

Possible Solution/Implementation Details

Hardware / Software used

scancode-toolkit
scancode-toolkit copied to clipboard