skpro icon indicating copy to clipboard operation
skpro copied to clipboard

[ENH] `random_state` for `BaseDistribution.sample`

Open fkiraly opened this issue 1 month ago • 10 comments

It might be good to have a random_state parameter in the sample method of BaseDistribution and all children, to ensure that sampling with a random seed is possible.

Open questions:

  • should the parameter be there, in sample? Alternative locations are the config via set_config, or the __init__.
  • how to handle composite distributions? They would need to generate dependent random_state for components, how should that be done in a reliable way? Sample the random_seed uniformly from [0,1] or the full domain?

Before implementing, I would appreciate input on the design questions.

fkiraly avatar Nov 27 '25 22:11 fkiraly

FYI @joshdunnlime, @SimonBlanke, @tingiskhan

fkiraly avatar Nov 27 '25 22:11 fkiraly

Hey @fkiraly , is it okay if I take this on? Plan would be:

  • add random_state=None to BaseDistribution.sample (and every override), normalize it via a helper that returns a np.random.Generator, and swap all direct np.random calls over to that generator;

  • extend stochastic helper methods (mean, var, energy, cdf, pdfnorm, etc.) with the same optional random_state, forwarding it to the internal sample calls;

  • give composite/delegate distributions (mixtures, IID, transforms, TF adapters) a small utility to spawn child generators so component samples stay deterministic.

Please let me know if you’re happy with that approach, and I’ll put a PR together.

Aymuos22 avatar Nov 28 '25 06:11 Aymuos22

Hi @fkiraly I've tried to resolve the issue

if possible can you check: https://github.com/sktime/skpro/pull/662

Aymuos22 avatar Nov 28 '25 07:11 Aymuos22

extend stochastic helper methods (mean, var, energy, cdf, pdfnorm, etc.) with the same optional random_state, forwarding it to the internal sample calls;

I see, the issue is that many methods have to be extended if we use a Monte Carlo approximation.

Would in this case a single random_state in the __init__ not be a better design, since it avoids too many changes? This is also closer to how random_state is managed in scikit-learn.

Please note that above I said: "Before implementing, I would appreciate input on the design questions." - of course your prototype is appreciated, @Aymuos22, but we cannot guarantee that it would be the final design.

fkiraly avatar Nov 28 '25 08:11 fkiraly

Hi @fkiraly

Got that, adding random_state to every Monte Carlo call was getting noisy. I’m good with the sklearn-style approach: accept random_state in init, keep an internal generator, and allow optional per-call overrides. if you approve it, I’ll update the prototype, or hold off until we lock in the final design.

Aymuos22 avatar Nov 28 '25 09:11 Aymuos22

I've modified the prototype

Needed feedback

Aymuos22 avatar Nov 28 '25 10:11 Aymuos22

Following sklearn closely seems sensible. Although the init would be great for controlling randomness on estimator itself, it's quite plausible that a user or dev would want to set the training seed separately from the monte carlo sample seed. I think think of plenty of use cases, especially during debugging.

Perhaps this should live in the distribution init? I guess would mean passing random_state to the predict_proba. Then again, you will probably want control of the seed without having to retrain the model which implies just passing it as a kwarg to the sample method.

I also really like the idea of being able to set a global config for each of:

  • estimator_seed
  • sample_seed
  • training_seed: maybe there is a better name but for things such as RandomGridSearch and CV splitters (if/when needed)

I appreciate that might be extending the scope of this ENH, but it would worth keeping in mind.

joshdunnlime avatar Nov 28 '25 21:11 joshdunnlime

@joshdunnlime, good point, we need to consider the situation where we

  1. fit/predict_proba with an estimator that may be itself rando
  2. the predict_proba returns a distribution

where the distribution in step 2 gets produced by the estimator in 1. If the random_seed is in the __init__, the user does not get to set it.

That is a very important observation imo, although I am not sure how to best take it into account.

Global configs are a good idea which has been previously discussed, I would very much appreciate comments how it could look like - as speculative code - for distributions, and in general.

I have opened an issue in scikit-base since that is a general topic. https://github.com/sktime/skbase/issues/478

fkiraly avatar Nov 29 '25 09:11 fkiraly

Regarding the highlighted scenario "user does not get to set random_seed for a distribution returned by predict_proba", I note - interestingly - that two of the three designs do not have this problem:

  • random_seed in sample and other method args. This may require adding it to a lot of methods, as @Aymuos22 has indicated, but we do not necessarily need to?
  • random_seed passed in the set_config method

In the cases where mean etc is getting sampled, the randomisation is not actually a property of the sampler, but of the approximation method. Maybe a good compromise is:

  • having random_seed in sample only
  • for the approximation methods, random_seed can be set in config, like the other approximation parameters like sample size of the Monte Carlo sample - perhaps as a config field "approx_random_seed"

this feels like consistent design, also see https://github.com/sktime/skpro/pull/656.

random_seed in sample also feels more stringent design-wise, since the seed is not an intrinsic property of the object being modelled - the distribution; whereas in scikit-learn, the random_seed is an intrinsic property of the algorithm that gets run, specifically, the instance

fkiraly avatar Nov 29 '25 09:11 fkiraly

Hi @fkiraly

Right now sample only uses the RNG I set in init, so if a forecaster returns a distribution from predict_proba, the caller can’t tweak the seed.

Plan is:

  • Give sample (and every override) a random_state=None kwarg again; normalize it on each call, fall back to the instance RNG if the user doesn’t pass one.

  • For composites like Mixture/IID/transforms/TF adapters, take that per-call generator and spawn child generators so component draws stay deterministic.

  • Add a config knob (e.g. approx_random_seed) for the Monte Carlo defaults so mean/var/energy can stay reproducible without threading the kwarg everywhere.

  • Update docs/tests/templates so users see how to pass the seed.

Aymuos22 avatar Nov 29 '25 13:11 Aymuos22