LightGBM
LightGBM copied to clipboard
Dataset construction uses all threads on the machine
Description
Passing nthreads to lightgbm.Dataset constructor (via the params parameter) doesn't seem to be taken into account. construct seems to use all cores on the machine in some phases. I would expect construct to be bound by the maximum number of threads specified.
Reproducible example
Loading large dataset via a hand-crafted Sequence object.
Environment info
LightGBM version or commit hash: 3.2.1
Thanks for using LightGBM! We need some more information from you before we can help.
- Are you able to provide a minimal, reproducible example that demonstrates this behavior?
- "Loading large dataset via a hand-crafted Sequence object" is not sufficient information for maintainers here to understand what you did and offer a suggestion without significant guessing.
- Can you please provide some of the other information that was requested in the issue template when you clicked "new issue"? Like:
- what programming language are you using?
- how did you install LightGBM?
- Can you try to install the latest version of LightGBM from source in this repo, or at least the latest released version (v3.3.2), and let us know if you still see this behavior?
I think this issue and #4598 have a same root cause.
Investigating #4598, I found substantial evidence that passing num_threads through Dataset parameters should correctly result in changing the number of threads used in Dataset construction: https://github.com/microsoft/LightGBM/issues/4598#issuecomment-1094194477.
I really think we need a reproducible example to be able to investigate this report further. Otherwise, solving this conclusively will require significant research and guessing to try to figure out what combination of parameters, LightGBM version, and Python code reproduces this behavior.
#4598 seems to investigate whether or not parallelism is enabled. The intended claim of this issue is that during some stages of the dataset construction ALL threads on the machine are used, ignoring the actual num_threads. The dataset doesn't matter much, it's the behavior of parallelism. At best, I can provide you with a screenshot of htop during the dataset construction.