LightGBM [RFC] make `deterministic` parameter more thorough?

Proposal

When deterministic = true in parameters, other configuration values should be overwritten with values that will reduce randomness, even if those values make LightGBM slower or more resource-intensive.

Specifically, deterministic = true should result in:

force_row_wise = true
gpu_use_dp = true
seed = 708 (or some other fixed, positive, non-0 number)

Motivation

In my experience responding to issues here and on Stack Overflow, LightGBM's sources of non-determinism are often misunderstood. It'd be simpler (for both users and maintainers) if it was possible to get deterministic results by just setting {"deterministic": true}.

Examples of the discussions I'm referring to:

#6964
#6790
https://stackoverflow.com/questions/79212471/why-do-i-get-different-performance-on-different-runs-on-my-ml-model
#6683
#6671
#6604
#6203
#6094
#6069
https://github.com/microsoft/LightGBM/issues/5887#issuecomment-1545935847
#4722
#4562
#3761
#3654
#3372 (the PR that led to deterministic being introduced in #3494)
#2085

Description

There are plenty of implementation details that would need to be worked out, but in short I'm envisioning this updating would happen only on the C++ side, not in wrappers.

And that a new documentation section should be added (not in the deterministic parameter's part of https://lightgbm.readthedocs.io/en/latest/Parameters.html), explaining the limitations of this approach and how to further improve reproducibility.

Reviewers

@shiyu1994 @guolinke @StrikerRUS @borchero @jmoralez @btrotta I'd like to hear your thoughts on this. Thanks for your consideration!

Nov 27 '24 04:11 jameslamb

I generally like this! In my experience, it is always slightly annoying to get ML libraries to behave purely deterministically. Having this feature built-in would be pretty nice.

That being said, I find it slightly weird to fix seed to a certain value inside the library code. I don't see how to achieve the desired behavior without it though...

Nov 27 '24 20:11 borchero

I agree that setting deterministic = true should involve setting other required params for reproducibility implicitly. But I'm not sure which particular approach would be better...

Nov 28 '24 11:11 StrikerRUS

I'm not sure which particular approach would be better...

Does this comment just refer to the 2 different approaches I mentioned for dealing with multi-threading? If so, I think we should start with the other sources of non-determinism (like setting force_row_wise = true) and come back to multi-threading separately.

I find it slightly weird to fix seed to a certain value inside the library code

To be clear, I'm proposing that passing deterministic = true set seed = 708 (or some other non-0 value) only if params does not already set seed to some non-0 value.

That's the simplest way I can think of to achieve this behavior, and I strongly suspect that most users setting deterministic = true would not care about which particular random seed is used, just that the same one is used on every run.

Nov 29 '24 03:11 jameslamb

@jameslamb

Does this comment just refer to the 2 different approaches I mentioned for dealing with multi-threading?

Yes, my comment was about those ones.

Nov 29 '24 09:11 StrikerRUS

Fully agree with this proposal. But for CUDA version, we may still encounter nondeterministic issues due to nondeterministic order of atomic operations in adding up the histograms. So far, I have now idea to solve this case yet, without sacrificing too much training speed.

Dec 01 '24 17:12 shiyu1994

Makes sense to me, I definitely think it is acceptable for the CUDA version to have weaker guarantees about determinism in exchange for being much faster.

Dec 01 '24 23:12 jameslamb

[RFC] make `deterministic` parameter more thorough?

Proposal

Motivation

Description

Other Questions

How to handle multithreading with OpenMP?

Reviewers