RANDOMNESS: L'Cuyer RNG streams instead of ad-hoc random seeds?
Both makeRegistry() and `makeExperimentRegistry() uses:
seed: [integer(1)] Start seed for jobs. Each job uses the (seed + job.id) as seed. Default is a random number in the range [1, .Machine$integer.max/2].
I'm not claiming to be an expert in RNGs, but I believe these type of ad hoc set.seed() approaches is not considered correct by those who do know more about this. Another way to put it is, if it would be this simple, why do RNG methods like the L'Cuyer RNG streams even exist? So, my best guess is that the current approach is likely to produce suboptimal random numbers and in worst case highly correlated random numbers.
I'd like to propose that you look at L'Cuyer RNG streams, which is implemented in the parallel package, meaning you don't need to add extra dependencies. The idea is that your seeds are generated as:
RNGkind("L'Ecuyer-CMRG")
if (!is.null(seed)) set.seed(seed) ## Set initial seed?
sample.int(n = 1L) ## Create `.Random.seed` in case missing
seed <- .Random.seed ## Initial L'Ecuyer-CMRG seed
seeds <- list()
for (kk in seq_along(jobs)) {
seed <- nextRNGStream(seed)
seeds[[kk]] <- seed
}
Then export the individual seeds[[job]] values to each job.
To improve on this and allow for nested generation of such seeds, I think one should do something like:
seeds <- list()
for (kk in seq_len(njobs)) {
seed <- nextRNGStream(seed)
seeds[[kk]] <- nextRNGSubStream(seed)
}
By using nextRNGSubStream(), my understanding is that process no 3, which receives seed seeds[[3]], can now in turn generate a new stream of seeds from seeds[[3]], without risking generating seeds seeds[4:njobs]. (PS. This is the approach I'm taking in developer's version of the future package.)
If you're not comfortable of making use of L'Cuyer RNG streams right now, may I propose to allow for the option of the user to control this, e.g. seed = myNextSeed where myNextSeed can be a function.
I'm not claiming to be an expert in RNGs, but I believe these type of ad hoc set.seed() approaches is not considered correct by those who do know more about this. Another way to put it is, if it would be this simple, why do RNG methods like the L'Cuyer RNG streams even exist? So, my best guess is that the current approach is likely to produce suboptimal random numbers and in worst case highly correlated random numbers.
Not an RNG export, neither. As far as I understand it, the current approach picks different states from the mersenne twister which has 2^19937 - 1 states (~4.3*10^6001). I don't know how R translates the integer values to a RNG state, this is buried somewhere in the C code. Assuming that the integers lead to states that are evenly distributed in the period, you would need to draw ~2*10^5992 / jobs numbers to get an overlap with the next stream. This is the best case analysis, of course. I guess L'Cuyer and other parallel RNGs exist so that you can get something which is guaranteed to not overlap, although it is very unlikely to happen if you use an ordinary RNG.
I'd like to propose that you look at L'Cuyer RNG streams, which is implemented in the parallel package, meaning you don't need to add extra dependencies. The idea is that your seeds are generated as: [...]
I like the idea, that would probably be the right thing to implement. However, I'm worried about the runtime if you work on large experiments. I've tried to pre-compute the streams for 1e6 jobs (which is not unlikely to have if you use BatchExperiments) and stopped after some minutes. This would slow down batchMap/addExperiments significantly.
A more reasonable approach would be to ask the job to get the i-th stream where i depends on the job number. Unfortunately, you need to create all other preceding streams to get to the right one. So job number 1e6 would need to compute all streams which again would take far too long. I've also looked into rlecluyer and rstreams, but I was unable to find a way to get the i-th stream efficiently.
Suggestions welcome, of course. Maybe I've overlooked something important.
If you're not comfortable of making use of L'Cuyer RNG streams right now, may I propose to allow for the option of the user to control this, e.g. seed = myNextSeed where myNextSeed can be a function.
This is something I'll look into. I guess this would be an extra argument for makeRegistry?
If you're not comfortable of making use of L'Cuyer RNG streams right now, may I propose to allow for the option of the user to control this, e.g. seed = myNextSeed where myNextSeed can be a function. This is something I'll look into. I guess this would be an extra argument for makeRegistry?
Is this really reasonable? I don't think so. My arguments goes like this:
- Either our current approach is OK or it is not.
- If it is OK, that new option is not useful (at least now)
- It it isn't OK, who do you think will use this new option? Especially as we 3 guys (me included) don't know what is best...
My argument is: the RNG handling needs to be foolproof and correct in its default behavior. That's important. Imagine a user using our defaults, then figuring out that his results are flawed. Would you think an answer ala "hey you could have avoided this problem by using this pretty technical extra option..." would be acceptable? I would be quite furious.
@mllg I am happy to talk about this, probably on hangout is best. We could also ask Uwe for a further opinion.
(EDITED typos)
If this is not critical for the JOSS, I'd like to postpone this. Okay with you @HenrikBengtsson ?
Certainly fine with me. This is a thing that shouldn't be rushed.
i have started to read up on this. and i am getting more and more worried. can we please talk about this? this concerns the validity of the results of everybody who computes with this.
https://www.johndcook.com/blog/2016/01/29/random-number-generator-seed-mistakes/
https://rpubs.com/Jouni_Helske/225931
we are currently testing this out, but i am quite sure we will see what helske saw after he said:
quote start: Okay, what if we just forget the whole random seeding, and use consecutive seeds for example based on the thread id? This is what John suggests in his blog, and which I have seen suggested elsewhere as well. Problem with this approach is that if the different seed values are of form seedi=a+bseedi−1seedi=a+bseedi−1, the generated samples are not necessarily completely independent. I am no expert in random number generation, but my understanding is that most if not all the test regarding the quality of the RNG algorithms are concerned what happens in a single stream of random numbers. Instead of diving into theory, let us make a simple test in R. quote end
i am running this now, which is the example in helske. and, naturally, i get the same results
load_all()
RNGkind("Marsaglia-Multicarry")
n = 10000
y1 = runif(n)
print(acf(y1))
makeRegistry(seed = 123L, file.dir = NA)
batchMap(function(x) {
RNGkind("Marsaglia-Multicarry")
runif(1)
}, 1:n)
submitJobs()
y2 = as.numeric(reduceResultsList())
print(acf(y2))
plots look really like in helske's link. i can see no worrysome auto-correlation for the mersenne EDIT: IN THIS VERY SIMPLE EXAMPLE. WHICH SAYS NOTHING ABOUT OTHER SITUATIONS! , which is the default in R, but for Marsaglia-Multicarry it looks very bad.
IMHO we really need to do something here, soon. This is completely intransparatrent to the user, and most people do not know enough about this, potentially including us. But nobody can debug this and discover potential problems if they appear on this front.
I'm using the L'Cuyer RNG stream approach, which, honestly, I use based on a fairly large portion of trust in how it's done in / chosen for the parallel package.
FYI, in R Journal Volume 8/1, August 2016, there's article 'CryptRndTest: An R Package for Testing the Cryptographic Randomness' [pdf] by Haydar Demirhan and Nihan Bitirim, which might be useful. Their CryptRndTest package is on CRAN. I haven't more than skimmed through that paper.
I'm using the L'Cuyer RNG stream approach, which, honestly, I use based on a fairly large portion of trust in how it's done in / chosen for the parallel package.
i looked at your code example above, again. thats what you are referring to. am I making a mistake, or are you doing EXACTLY the same as done in batchtools, with the only exception that you generate the seeds in a special way?
EDIT: how many typos can one fix in 3 lines of text .....?
(and i am pretty sure they have thought about that pretty hard for the parallel package, not distrusting that at all)
When iterating over elements x, you can do something like:
RNGkind("L'Ecuyer-CMRG")
seed <- .GlobalEnv$.Random.seed
for (ii in seq_along(x)) {
seed <- nextRNGStream(seed)
job.seed <- nextRNGSubStream(seed)
create_job_ii_with_seed(id = ii, x[[ii]], ..., seed = job.seed)
}
You don't want to just use job.seed <- seed, because then if job ii would call seed_ii_next <- nextRNGStream(seed_ii) in recursive calls, then seed_ii_next == seed_jj where jj = ii + 1 (== the seed for job ii+1). At least this is my interpretation on how to use nextRNGStream() and nextRNGSubStream().
I'm basically using this approach in my (upcoming) future_lapply() which you can find in https://github.com/HenrikBengtsson/future/blob/986eca/R/future_lapply.R, but it's a bit convoluted with other code too.
Forgot to say, I don't see any usage of nextRNGStream() in batchtools, which uses https://github.com/mllg/batchtools/blob/master/R/helpers.R#L145-L150
As far as I understood it, nextRNGStream() should be sufficient for the job seed. If a job needs to start its own parallelization on the slave, it's spawned threads/forks/workers are then supposed to use nextRNGSubStream().
I'll start a first implementation next week.
ok, very cool. we can surely test from Munich and help with checking (as this is somewhat subtle and complicated)
Modulo my lack of deep understanding the streams and substreams, I would say using nextRNGSubStream() removes the risk of correlated streams. It's not obvious to someone that they should call .Random.seed <- nextRNGSubStream(.Random.seed) the first thing they do in their job. Some also don't know that downstream packages / function call may rely on RNG. If missed, I think there's a risk for two jobs' RNG streams being correlated. This is why I'd say it's better to just do that automatically. However, it could be there's a downside to this approach that I'm not aware of.
I think there's a risk for two jobs' RNG streams being correlated
thats basically what i can show empirically with my code now, in a trivial case, see my post above.
This is why I'd say it's better to just do that automatically
and of course when cannot shift the responsibility of this onto the user. i am pretty sure in realistic setup, no major problems should have occurred so far, but as you are saying we REALLY want to get this right here, in a convenient fashion, for everybody
I'm not sure we're talking about the same problem here. Below is a minimal example showing the risk of using just nextRNGStream() versus nextRNGStream() + nextRNGSubstream().
library("parallel")
RNGkind("L'Ecuyer-CMRG")
set.seed(0xBEEF)
## A function that does NOT forward RNG state
## Imagine this lives in deep down in a package dependency.
calc_a <- function(kk) {
rnorm(2)
}
## A function that DOES forward RNG state
## Imagine this lives in deep down in a package dependency.
calc_b <- function(kk) {
if (kk == 1) {
.GlobalEnv$.Random.seed <- nextRNGStream(.GlobalEnv$.Random.seed)
}
rnorm(2)
}
seeds <- list()
seeds[[1]] <- .GlobalEnv$.Random.seed
seeds[[2]] <- nextRNGStream(seeds[[1]])
y <- lapply(1:2, FUN = function(kk) {
.GlobalEnv$.Random.seed <- seeds[[kk]]
calc_a(kk)
})
str(y)
## List of 2
## $ : num [1:2] 1.072 -0.202
## $ : num [1:2] 1.027 0.407
y2 <- lapply(1:2, FUN = function(kk) {
.GlobalEnv$.Random.seed <- seeds[[kk]]
calc_b(kk)
})
str(y2)
## List of 2
## $ : num [1:2] 1.027 0.407
## $ : num [1:2] 1.027 0.407
## The following is not good
stopifnot(identical(y2[[2]], y2[[1]]))
I'm still not convinced that using substreams per default is the right approach. I've looked into the parallel package to find out what mclapply does internally:
- In the first iteration, the package ensures that the RNG in initialized and set to L'Ecuyer, and uses the current state
- In each subsequent iteration, advances the stream with
nextRNGStream().
nextRNGSubStream() is not used at all. I'll investigate how other packages deal with this....
Bringing this back up as I stumbled across the default seeds last week (seed + job.id). I am not an expert on rng seeds, so I was wondering whether there is an update to this issue or recommended workflow for setting the seeds in batchtools?