future icon indicating copy to clipboard operation
future copied to clipboard

About the `future.seed` default

Open pat-s opened this issue 4 years ago • 1 comments

Quoting from our mail conversation

The current default with future.seed=FALSE is not set in stone but I need to think about it more. Ideally, if we could detect whether parallel RNG is needed or not, it could be set automatically. But I doubt that will ever be possible - it would require annotating all functions specify if they use RNGs or not. There was also the discussion of detecting when RNGs were indeed used even if future.seed=FALSE. If detected, a warning or even an error could be produced. This would not be too hard to implement and the overhead would be minimal. This should prevent calling future_lapply() et al. without future.seed=TRUE when truly needed. (This is on my radar since a while)

Regarding a dynamic setting of future.seed: Would it make sense to check which future::plan is requested? If it is a parallel one, turn it on internally by default - if not, leave it off.

This way users would use the default RNG kind when using plan(sequential) and the "L'Ecuyer-CMRG" one in parallel scenarios. Due to future.seed = TRUE, both would magically work with just set.seed() and there is no overhead when it is not needed.

When it comes to scenarios when reproducibility is not wanted but only speed: It would be great if users could turn future.seed off on their side and not rely on what a package devs set it to within the package. Hence I'd like to an option to overwrite future.seed on the user level when setting the future::plan() - this is even unrelated to all other ideas in here.

With all the options from above, practical scenarios could look as follows:

  • Parallel processes are reproducible by default because "L'Ecuyer-CMRG" is used via a dynamic future.seed argument
  • No overhead for sequential runs (future.seed = FALSE always). If a sequential plan detected future.seed = TRUE, a warning could be issued
  • If speed is > reproducibility, users can turn off the latter by setting future::plan(<plan>, future.seed = FALSE) which will take precedence over any settings downstream in any future_*apply() call

pat-s avatar Mar 18 '20 13:03 pat-s

Some quick comments:

  1. The overall design objective is that futures should give the exact same results regardless of backend. I believe that is one of the core strengths of the Future API. It minimizes surprises and helps developers and users to focus on the task/analysis at hand without having to worry about various ifs and whats. This should also explain why I'm hesitant/conservative in introducing features to plan() where the user can potentially break the intention that the developer had in mind. Having said that, I'm constantly trying to figure ways to allow for adjustments without breaking this objective.

  2. Regarding RNG in map-reduce pattern: it is known that the current, very conservative, approach that future.apply takes, which pre-generate a RNG seed for each element processed, is time consuming. This overhead can be ignored in very long-running tasks, but for quicker one it becomes a show stopper. There is an open future.apply issue (https://github.com/HenrikBengtsson/future.apply/issues/20) which would open up for producing a single RNG seed per future. This would break perfect reproducibility, but would still be statistically sound. This what parallel::mclapply() does by default. This approach I believe is safe to introduce, because it is in control of the developer and not the user. This should solve your slowness issues.

HenrikBengtsson avatar Mar 30 '20 19:03 HenrikBengtsson