scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

CLI Option `processes`, what is a good value?

Open Jeeppler opened this issue 2 years ago • 3 comments

Description

The documentation contains very little information about the effect of the --processes parameter. It mentions the default value, but not much more.

I am struggling to understand how many processes would be a sensible value to improve the speed of the scan? The default is 1, I guess that is a conservative value to ensure Scancode-Toolkit works even in resource restricted environments (e.g. low powered VM).

Should the processes be equal to the number of cores of the CPU? Is there an amount of processes which is just too much? Because not every part of the code can be parallelized.

While it is difficult to give a general recommendation. It would be beneficial to have more information, so that users can find there individual "good" value.

Link to Documentation Page

https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/list-options.html#all-core-scan-options

Select Category

  • [ ] Inconsistency
  • [ ] New Section Request
  • [x] General Improvement
  • [ ] Typo/Mistakes
  • [ ] Other

For more information have a look at: #608

Jeeppler avatar Jun 03 '22 20:06 Jeeppler

Excellent point!

A good rule of thumb for maxing out CPU usage would be number of CPU minus one to yield some to other things for sanity. I once got a report of someone running a scan https://github.com/rbarrois/python-semanticversion/issues/100on a whole Debian packages pool on a small server with 192 cores and there using 192 worked nicely. There is usually not much benefit from using a processes count higher than the number of cores (I am thinking virtual cores here).

In general, ScanCode is mostly CPU-bound and will always benefit of more CPU!

I routinely run scans on my quad cores/eight threads laptop with -n7 and that's the best compromise if I do nothing else. Otherwise I use -n4 or -n5 when I am also working on something else. I do not see speed improvements to run more than 7 processes there, hence the rule of thumb of number of CPUs minus one. I like to think of this "minus one" as the "one for the teapot".

pombredanne avatar Jun 04 '22 10:06 pombredanne

@pombredanne I would like to contribute the explanation to the documentation. Where is the appropriate place/file to write this in?

Jeeppler avatar Jun 07 '22 16:06 Jeeppler

We would want to add documentation here for which the file is at here similar to the section here at https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/core-options.html#timeout-option . We could also add upon this part: https://github.com/nexB/scancode-toolkit/blob/develop/docs/source/rst_snippets/core_options.rst

AyanSinhaMahapatra avatar Jun 07 '22 20:06 AyanSinhaMahapatra

In any case, a default of 1 process is counter-productive as it will make ScanCode look to be much slower than it could be for starters. I agree that number of available CPU cores - 1 (or at least number of available CPU cores / 2) should be a good default value.

sschuberth avatar Jul 18 '23 07:07 sschuberth