cmdstanr icon indicating copy to clipboard operation
cmdstanr copied to clipboard

Unexpected non-reproducibility issue when setting initial conditions

Open andreashandel opened this issue 2 years ago • 7 comments

Describe the bug I'm running a model with user-specified initial conditions, and despite setting the random seed, things are not reproducible in the way I would expect them to be.

To Reproduce See the code example at the end. That code shows a full example of what works and what doesn't work (and how I can get it to work, but I don't understand why).

Expected behavior I would expect to get the same results for each of the 3 scenarios shown below.

Operating system Ubuntu 18.04

CmdStanR version number 0.5.2 (with CmdStan 2.30.1)

Code to reproduce issue:

library(cmdstanr)

# data
N = 80
dat <- list(ID = 1:N, C = runif(N,10,1000),Y = runif(N,10,100), Nobs = N)

# stan code
stan_model <- "
data {
  int<lower = 1> Nobs; //number of observations
  vector[Nobs] Y; //outcome of interest, here inhibition
  vector[Nobs] C; //concentration
}
parameters {
  real<lower = 0> sigma; //uncertainty
  real<lower = 0> m; //max effect level at high concentration
  real<lower = 0> k; //slope parameter
  real<lower = 0> h; //50 percent effect parameter
}
model{
  vector[Nobs] e; 
  m ~ lognormal(5, 0.2); 
  k ~ lognormal( 0, 0.5);
  h ~ lognormal( 5, 1);
  sigma ~ cauchy(0, 5);
  for ( i in 1:Nobs) {
      e[i] = m * C[i]^k / (h^k + C[i]^k); //main deterministic model
  }
  Y ~ normal(e, sigma); //distribution of outcome
}"
sfile <- write_stan_file(stan_model)

# settings for fitting
# initial values are specified as function 
# so a different sample will be drawn for each chain
fs1 = list(warmup = 400,
          sampling = 600, 
          chains = 2,
          cores  = 2,
          init = function(chain_id) list(m = runif(1,80,120), 
                                                  k = runif(1,1,2), 
                                                  h = runif(1,50,100), 
                                                  sigma =  runif(1,4,6)),
          seed = 1234)


#####################################################
# This setup below leads to reproducible results
# In this setup, cmdstanr/Stan is generating initial values.
#####################################################
stanmod <- cmdstanr::cmdstan_model(sfile, force_recompile=TRUE)

res1 <- stanmod$sample(data = dat,
                       chains = fs1$chains,
                       seed = fs1$seed,
                       parallel_chains = fs1$cores,
                       iter_warmup = fs1$warmup,
                       iter_sampling = fs1$sampling
)

res1a <- stanmod$sample(data = dat,
                        chains = fs1$chains,
                        seed = fs1$seed,
                        parallel_chains = fs1$cores,
                        iter_warmup = fs1$warmup,
                        iter_sampling = fs1$sampling
)
print(res1)
print(res1a)




#####################################################
# This setup below only differs by supplying initial values
# It should (in my opinion) produce the same outcomes for the 2 runs
# but it does not do that
#####################################################
stanmod <- cmdstanr::cmdstan_model(sfile, force_recompile=TRUE)
res1 <- stanmod$sample(data = dat,
                        chains = fs1$chains,
                        init = fs1$init,
                        seed = fs1$seed,
                        parallel_chains = fs1$cores,
                        iter_warmup = fs1$warmup,
                        iter_sampling = fs1$sampling
)
res1a <- stanmod$sample(data = dat,
                       chains = fs1$chains,
                       init = fs1$init,
                       seed = fs1$seed,
                       parallel_chains = fs1$cores,
                       iter_warmup = fs1$warmup,
                       iter_sampling = fs1$sampling
)
print(res1)
print(res1a)

#####################################################
# This setup below leads to reproducible results
# but only if the same seed is set before each call to cmdstan_model()
# otherwise results are not reproducible
# It is not clear to me why cmdstan_model() would care about the random seed
# and why I need to set the same seed and recompile the model each time
# to ensure reproducibility
#####################################################
set.seed(12345) 
stanmod <- cmdstanr::cmdstan_model(sfile, force_recompile=TRUE)
res1 <- stanmod$sample(data = dat,
                       chains = fs1$chains,
                       init = fs1$init,
                       seed = fs1$seed,
                       parallel_chains = fs1$cores,
                       iter_warmup = fs1$warmup,
                       iter_sampling = fs1$sampling
)
set.seed(12345) #reproducible only if this is turned on
stanmod1a <- cmdstanr::cmdstan_model(sfile, force_recompile=TRUE)
res1a <- stanmod1a$sample(data = dat,
                        chains = fs1$chains,
                        init = fs1$init,
                        seed = fs1$seed,
                        parallel_chains = fs1$cores,
                        iter_warmup = fs1$warmup,
                        iter_sampling = fs1$sampling
)
print(res1)
print(res1a)

andreashandel avatar Aug 16 '22 20:08 andreashandel

I think it's because we don't want to change the internal state of R by changing the seed value for a single run of a stan program. I recommend doing the set.seed(12345) version of your above. @rok-cesnovar maybe we could have an r_init_seed parameter that actually does do set.seed()? I'm personally fine with setting the seed outside of cmdstanr

SteveBronder avatar Aug 16 '22 23:08 SteveBronder

Thanks for the feedback. I'm still puzzled why the cmdstan_model() call would invoke the random number generator. At that stage, I haven't even specified yet if I plan to supply my own initial conditions (and if they are random or not), so this part is confusing to me. Could you explain (just in high-level terms) what happens "under the hood" that cmdstan_model() would need to generate random numbers? Thanks!

andreashandel avatar Aug 17 '22 11:08 andreashandel

The thing that happens is that cmdstan_model calls your init() function in order to generate the init values for all required chains.

The init function calls runif in 4 places.

function(chain_id) list(m = runif(1,80,120), 
                                                  k = runif(1,1,2), 
                                                  h = runif(1,50,100), 
                                                  sigma =  runif(1,4,6))

If you replace those with

For full reproducibility, you need to either generate those initial values beforehand and supply them to cmdstan_model or call set.seed(). You can try replacing the runif calls with something that does not require generating random numbers and you will see that the runs will be reproducible.

rok-cesnovar avatar Aug 17 '22 12:08 rok-cesnovar

I just noticed that your function doesn't really use the chain_id arg, so what you need is to just create a list of lists.

init_vals_1chain <- list(m = runif(1,80,120), k = runif(1,1,2), h = runif(1,50,100), sigma =  runif(1,4,6))
init_vals <- list(init_vals_1chain, init_vals_1chain)

and you then pass that init_vals to cmdstan_model(), assuming you wanted to init all chains to the same values.

rok-cesnovar avatar Aug 17 '22 12:08 rok-cesnovar

Thanks! If I change my code to what you suggested instead of using the function() approach for the inits - which is in fact how I coded it initially :) - it works.

I'm still a bit confused why the call to cmdstan_model() without the following call to $sample should already invoke the random generator, it seems when I do the initial cmdstan_model(), I have not told cmdstanr any information about the initial conditions, that only shows up in the $sample statement. Maybe that has to do with that - still mysterious to me - R6 object.

Specifically, I'm still confused that in this snippet of code, the random numbers differ (which means cmdstan_model() must have triggered some random numbers).

set.seed(12345) 
runif(5)
set.seed(12345) 
stanmod <- cmdstanr::cmdstan_model(sfile, force_recompile=TRUE)
runif(5)

andreashandel avatar Aug 17 '22 12:08 andreashandel

Cmdstan uses the rng for creating temporary files when compiling a model

SteveBronder avatar Aug 17 '22 14:08 SteveBronder

Ah ok, that explains this! I didn't expect it to use the random number generator. (Maybe worth considering resetting, or telling user that this happens. Was unexpected for me.) Anyways, my immediate issue is resolved, thanks for the help! (Feel free to close issue, just leaving it open in case you think addressing this behavior in the future is worth it).

andreashandel avatar Aug 17 '22 14:08 andreashandel