[ENH] `random_state` for `BaseDistribution.sample`
It might be good to have a random_state parameter in the sample method of BaseDistribution and all children, to ensure that sampling with a random seed is possible.
Open questions:
- should the parameter be there, in
sample? Alternative locations are the config viaset_config, or the__init__. - how to handle composite distributions? They would need to generate dependent
random_statefor components, how should that be done in a reliable way? Sample therandom_seeduniformly from [0,1] or the full domain?
Before implementing, I would appreciate input on the design questions.
FYI @joshdunnlime, @SimonBlanke, @tingiskhan
Hey @fkiraly , is it okay if I take this on? Plan would be:
-
add random_state=None to BaseDistribution.sample (and every override), normalize it via a helper that returns a np.random.Generator, and swap all direct np.random calls over to that generator;
-
extend stochastic helper methods (mean, var, energy, cdf, pdfnorm, etc.) with the same optional random_state, forwarding it to the internal sample calls;
-
give composite/delegate distributions (mixtures, IID, transforms, TF adapters) a small utility to spawn child generators so component samples stay deterministic.
Please let me know if you’re happy with that approach, and I’ll put a PR together.
Hi @fkiraly I've tried to resolve the issue
if possible can you check: https://github.com/sktime/skpro/pull/662
extend stochastic helper methods (mean, var, energy, cdf, pdfnorm, etc.) with the same optional random_state, forwarding it to the internal sample calls;
I see, the issue is that many methods have to be extended if we use a Monte Carlo approximation.
Would in this case a single random_state in the __init__ not be a better design, since it avoids too many changes? This is also closer to how random_state is managed in scikit-learn.
Please note that above I said: "Before implementing, I would appreciate input on the design questions." - of course your prototype is appreciated, @Aymuos22, but we cannot guarantee that it would be the final design.
Hi @fkiraly
Got that, adding random_state to every Monte Carlo call was getting noisy. I’m good with the sklearn-style approach: accept random_state in init, keep an internal generator, and allow optional per-call overrides. if you approve it, I’ll update the prototype, or hold off until we lock in the final design.
I've modified the prototype
Needed feedback
Following sklearn closely seems sensible. Although the init would be great for controlling randomness on estimator itself, it's quite plausible that a user or dev would want to set the training seed separately from the monte carlo sample seed. I think think of plenty of use cases, especially during debugging.
Perhaps this should live in the distribution init? I guess would mean passing random_state to the predict_proba. Then again, you will probably want control of the seed without having to retrain the model which implies just passing it as a kwarg to the sample method.
I also really like the idea of being able to set a global config for each of:
- estimator_seed
- sample_seed
- training_seed: maybe there is a better name but for things such as RandomGridSearch and CV splitters (if/when needed)
I appreciate that might be extending the scope of this ENH, but it would worth keeping in mind.
@joshdunnlime, good point, we need to consider the situation where we
- fit/predict_proba with an estimator that may be itself rando
- the predict_proba returns a distribution
where the distribution in step 2 gets produced by the estimator in 1. If the random_seed is in the __init__, the user does not get to set it.
That is a very important observation imo, although I am not sure how to best take it into account.
Global configs are a good idea which has been previously discussed, I would very much appreciate comments how it could look like - as speculative code - for distributions, and in general.
I have opened an issue in scikit-base since that is a general topic.
https://github.com/sktime/skbase/issues/478
Regarding the highlighted scenario "user does not get to set random_seed for a distribution returned by predict_proba", I note - interestingly - that two of the three designs do not have this problem:
random_seedinsampleand other method args. This may require adding it to a lot of methods, as @Aymuos22 has indicated, but we do not necessarily need to?random_seedpassed in theset_configmethod
In the cases where mean etc is getting sampled, the randomisation is not actually a property of the sampler, but of the approximation method. Maybe a good compromise is:
- having
random_seedinsampleonly - for the approximation methods,
random_seedcan be set in config, like the other approximation parameters like sample size of the Monte Carlo sample - perhaps as a config field"approx_random_seed"
this feels like consistent design, also see https://github.com/sktime/skpro/pull/656.
random_seed in sample also feels more stringent design-wise, since the seed is not an intrinsic property of the object being modelled - the distribution; whereas in scikit-learn, the random_seed is an intrinsic property of the algorithm that gets run, specifically, the instance
Hi @fkiraly
Right now sample only uses the RNG I set in init, so if a forecaster returns a distribution from predict_proba, the caller can’t tweak the seed.
Plan is:
-
Give sample (and every override) a random_state=None kwarg again; normalize it on each call, fall back to the instance RNG if the user doesn’t pass one.
-
For composites like Mixture/IID/transforms/TF adapters, take that per-call generator and spawn child generators so component draws stay deterministic.
-
Add a config knob (e.g. approx_random_seed) for the Monte Carlo defaults so mean/var/energy can stay reproducible without threading the kwarg everywhere.
-
Update docs/tests/templates so users see how to pass the seed.