scancode-toolkit
scancode-toolkit copied to clipboard
Feature Request: Parallalize all steps of scancode if possible
Short Description
In all but the file scanning step, scancode only uses one core, resulting in a low CPU load and a long runtime.
Select Category
- [X] Enhancement
How This Feature will help you/your organization
It should reduce the overall runtime substantially.
Possible Solution/Implementation Details
I started scancode on a large codebase with parameter -n 30. After two hours (on a 32 core CPU), it was still in the state Collect file inventory.... The entire code base is on a M.2 SSD. The task manager shows little CPU, Memory, and IO activity.
Please parallelize all steps of scancode, including the file collection step.
Hardware / Software used
AMD Ryzen 9 3950X (16 core, 32 thread), 32GB RAM, M.2 SSD 970 EVO Plus 1TB Windows 10 Enterprise LTSC (1809), Python 3.9, Scancode 30.0.0
@FrankHeimes Thank you... yes this makes perfect sense, in particular when scanning whole devices where there can a lot of files and just the basic directory listing could benefit from speedups. Do you have experience on doing such parallelization for what could be described as a parallelized file system walk?
@pombredanne No, I don't. But I just did a quick test: Walking all files and folders recursively using PowerShell took 11 seconds, enumerating only the folders took 5 seconds; both for the large codebase mentioned above. That is reasonably fast so that this doesn't require parallelization. Apparently, the processing of each individual file (reading, classification, etc.) takes a lot of time.
So I'd do it like this:
- Enumerate all files recursively; only recording their full path in a list in memory.
- Distribute the filenames on all available cores for further processing in parallel, i. e. classification or whatever needs to be done.
A slightly different approach would be to immediately start scanning while the file inventory process is still running. So even before the file inventory would be complete different threads would already start processing files. This would alleviate the pain somewhat (unless inventory for a single file takes longer than scanning that single file).
I also agree that as little as possible should be done in the inventory phase. By combining these approaches (moving processing outside of file inventory step and already starting to process during inventory) it would probably make the best use of resources.