[RFC] make `deterministic` parameter more thorough?
Proposal
When deterministic = true in parameters, other configuration values should be overwritten with values that will reduce randomness, even if those values make LightGBM slower or more resource-intensive.
Specifically, deterministic = true should result in:
-
force_row_wise = true -
gpu_use_dp = true -
seed = 708(or some other fixed, positive, non-0 number)
Motivation
In my experience responding to issues here and on Stack Overflow, LightGBM's sources of non-determinism are often misunderstood. It'd be simpler (for both users and maintainers) if it was possible to get deterministic results by just setting {"deterministic": true}.
Examples of the discussions I'm referring to:
- #6964
- #6790
- https://stackoverflow.com/questions/79212471/why-do-i-get-different-performance-on-different-runs-on-my-ml-model
- #6683
- #6671
- #6604
- #6203
- #6094
- #6069
- https://github.com/microsoft/LightGBM/issues/5887#issuecomment-1545935847
- #4722
- #4562
- #3761
- #3654
- #3372 (the PR that led to
deterministicbeing introduced in #3494) - #2085
Description
There are plenty of implementation details that would need to be worked out, but in short I'm envisioning this updating would happen only on the C++ side, not in wrappers.
And that a new documentation section should be added (not in the deterministic parameter's part of https://lightgbm.readthedocs.io/en/latest/Parameters.html), explaining the limitations of this approach and how to further improve reproducibility.
Other Questions
How to handle multithreading with OpenMP?
Proposing we do the following:
- audit all OpenMP pragmas and confirm that any where ordering matters (e.g. gradient accumulation) turn off parallelism when
deterministic = true(as in #3494) - add test(s) that
deterministic = trueis sufficient to get completely deterministic results for the CPU version
But alternatively, we could do something simpler:
- set
num_threads = 1wheneverdeterministic = true
Which is preferable?
Reviewers
@shiyu1994 @guolinke @StrikerRUS @borchero @jmoralez @btrotta I'd like to hear your thoughts on this. Thanks for your consideration!
I generally like this! In my experience, it is always slightly annoying to get ML libraries to behave purely deterministically. Having this feature built-in would be pretty nice.
That being said, I find it slightly weird to fix seed to a certain value inside the library code. I don't see how to achieve the desired behavior without it though...
I agree that setting deterministic = true should involve setting other required params for reproducibility implicitly. But I'm not sure which particular approach would be better...
I'm not sure which particular approach would be better...
Does this comment just refer to the 2 different approaches I mentioned for dealing with multi-threading? If so, I think we should start with the other sources of non-determinism (like setting force_row_wise = true) and come back to multi-threading separately.
I find it slightly weird to fix seed to a certain value inside the library code
To be clear, I'm proposing that passing deterministic = true set seed = 708 (or some other non-0 value) only if params does not already set seed to some non-0 value.
That's the simplest way I can think of to achieve this behavior, and I strongly suspect that most users setting deterministic = true would not care about which particular random seed is used, just that the same one is used on every run.
@jameslamb
Does this comment just refer to the 2 different approaches I mentioned for dealing with multi-threading?
Yes, my comment was about those ones.
Fully agree with this proposal. But for CUDA version, we may still encounter nondeterministic issues due to nondeterministic order of atomic operations in adding up the histograms. So far, I have now idea to solve this case yet, without sacrificing too much training speed.
Makes sense to me, I definitely think it is acceptable for the CUDA version to have weaker guarantees about determinism in exchange for being much faster.