scikit-learn
scikit-learn copied to clipboard
Using `rng=` keyword argument for NumPy randomness
In SPEC7 https://github.com/scientific-python/specs/pull/180, has two goals:
- Deprecate the use of
RandomStateandnp.random.seed - Standardize the usage of
rngfor setting seeding.
For 1, according to NEP19, I do not think NumPy wants to deprecate np.random.seed because they see valid use cases.
For 2, the primary reason around using rng instead of random_state is that it is a "better name" for NumPy's Random Generator. I am okay with keeping random_state and not have users go the pain of changing their code.
Currently, scikit-learn does not support generators because we tied it to https://github.com/scikit-learn/enhancement_proposals/pull/88. We wanted to use generators to cleanly switch to a different RNG behavior compared to RandomState. For me, I think they can be decoupled. If we tackle https://github.com/scikit-learn/enhancement_proposals/pull/88, we can fix it for both RandomState and Generators.
@scikit-learn/core-devs What do you think of SPEC7's proposal?
For 2, the primary reason around using rng instead of random_state is that it is a "better name" for NumPy's [3]Random Generator. I am okay with keeping random_state and not have users go the pain of changing their code.
+1
It was interesting to me to witness the reactions to numpy 2.0. Senior and visible people who had been vocal to criticize historical choices in numpy and slowness of numpy to move forward where also the same to complain about breakage across their stacks and environments.
It was interesting to me to witness the reactions to numpy 2.0. Senior and visible people who had been vocal to criticize historical choices in numpy and slowness of numpy to move forward where also the same to complain about breakage across their stacks and environments.
Oh well, I guess it's not that easy to make people happy :stuck_out_tongue_closed_eyes:
The random_state discussion got quite convoluted, and the last meeting that we talked about it a while ago, I didn't get the impression that we have a way forward.
I personally don't mind renaming random_state to rng IF it unlocks some backward compatibility concerns with fixing random_state. New name, new behavior, otherwise I don't care much about changing the name because of the SPEC.
I also wouldn't deprecate RandomState's usage, but happy to move all our own code / examples to the generators instead.
I agree that we should provide a benefit to the user if we make them change from random_state to rng. I don't know what that benefit would be though. Especially given that Numpy seems to have no plan to remove the old random number infrastructure. Basically, I am uneducated about the advantages and not aware of one but dismissive of it.
I like the idea of using the switch from random_state= to rng= to also fix the randomness ambiguities. However I also think that fixing both at the same time has a lower chance of happening than if we tackle them independently. The price to tackle them independently is more user upheaval, I think.
I am okay with keeping
random_stateand not have users go the pain of changing their code.
Thomas, is this a proposal to add rng= without removing/deprecating random_state=?
Thomas, is this a proposal to add rng= without removing/deprecating random_state=?
No. That proposal is to not use rng at all and continue to use random_state even when we support generators.
This means we'd also accept a random generator as argument to random_state? What problem do we solve with that? One that I can think of is people who use generators and expect them to work across the eco-system will have a good experience.
That seems like a good enough reason to allow people to pass a generator as random_state=.
For 1, according to NEP19, I do not think NumPy wants to deprecate
np.random.seedbecause they see valid use cases.
To be quite clear, in NEP 19, I am very much staking the ground that scikit-learn is not one of the valid use cases for np.random.seed(). It is a blight upon you and your users (to be clear, one that I inflicted). As the NumPy project, per se, we are not going to remove np.random.seed(), but as the author of np.random, I vigorously encourage you to pretend that it does not exist.
For 2, the primary reason around using
rnginstead ofrandom_stateis that it is a "better name" for NumPy's Random Generator. I am okay with keepingrandom_stateand not have users go the pain of changing their code.
SPEC7 does not recommend a name change just because we are dealing with Generator objects rather than RandomState objects. SPEC7 is recommending changing the semantics of what the default random_state=None does; therefore, one should probably go through a deprecation cycle with a name change to reflect the semantics change. SPEC7 recommends (in the eventual state) that one use rng=None keyword arguments, then doing rng = np.random.default_rng(rng) to coerce the argument to a Generator. default_rng() is analogous to check_random_state() except, crucially, that there is no default global Generator like there is a default global RandomState. np.random.default_rng(None) creates a completely new Generator. If there had been no semantics change, we would have recommended retaining the random_state name despite dropping RandomState instances.
The deprecation lifecycle proposed by SPEC7 has an intermediate state that is somewhat along the lines of "allow people to pass a Generator but don't forbid RandomState or break current code" but it does involve an rng= argument.
I personally don't mind renaming
random_statetorngIF it unlocks some backward compatibility concerns with fixingrandom_state.
Yes, this is the case.
As for what features using Generators (and Generators alone) can buy you: https://github.com/scikit-learn/scikit-learn/issues/16988#issuecomment-1518037853