BayesianTools icon indicating copy to clipboard operation
BayesianTools copied to clipboard

Likelihood parallelisation for SHAW hydrological model

Open florianhartig opened this issue 4 years ago • 9 comments

Question from a user:

I'm aiming to use BayesianTools for a hydrologic model parameterization. I have the MCMC setup implemented with my model (the SHAW model), and I'd like to run it in parallel. SHAW is a fortran executable with file input/output, so I think I need to use parallel = "external". I've been struggling with this, so I tried to modify the VSEM example to use parallel = "external" to help me understand it. When I do so, I get an error that makes me think the likelihood function isn't actually being passed a matrix by the runMCMC() function, with a rejection rate of 1.

I'm sharing my code with the VSEM example; if you have any advice about why this example doesn't work, I would really appreciate it, and I think it would help me to get my own model running in parallel. Alternatively, if you would prefer I post this as an issue on github, or if there's other documentation you think I should be looking at, I'll take any advice you have!

# Running VSEM example in parallel. 
library(BayesianTools)
library(furrr)


# Create input data for the model
PAR <- VSEMcreatePAR(1:1000)
plot(PAR, main = "PAR (driving the model)", xlab = "Day")

# load reference parameter definition (upper, lower prior)
refPars <- VSEMgetDefaults()
# this adds one additional parameter for the likelihood standard deviation (see below)
refPars[12,] <- c(2, 0.1, 4) 
rownames(refPars)[12] <- "error-sd"
head(refPars)

# create some simulated test data 
# generally recommended to start with simulated data before moving to real data
referenceData <- VSEM(refPars$best[1:11], PAR) # model predictions with reference parameters  
referenceData[,1] = 1000 * referenceData[,1]  # is this just units? (AM)
# this adds the error - needs to conform to the error definition in the likelihood
obs <- referenceData + rnorm(length(referenceData), sd = refPars$best[12]) # fake observations.
oldpar <- par(mfrow = c(2,2))
for (i in 1:4) plotTimeSeries(observed = obs[,i], 
                              predicted = referenceData[,i], 
                              main = colnames(referenceData)[i]) # plot "observed" and predicted. 

# Best to program in a way that we can choose easily which parameters to calibrate
parSel = c(1:6, 12)

# here is the parallel likelihood - par should be a matrix with rows = cases; columns = parameters. 
likelihood_parallel <- function(par, sum = TRUE){
  print(class(par))
  par2 <- asplit(par, 1) # list of parameter sets
  
  plan(multisession)
  ans1 <- future_map_dbl(par2, function(par1){
    # set parameters that are not calibrated on default values 
    x = refPars$best # full set of parameters
    x[parSel] = par1 # parameters to be calibrated. 
    predicted <- VSEM(x[1:11], PAR) # replace here VSEM with your model 
    predicted[,1] = 1000 * predicted[,1] # this is just rescaling
    diff <- c(predicted[,1:4] - obs[,1:4]) # difference between observed and predicted (model residuals)
    
    # univariate normal likelihood. Note that there is a parameter involved here that is fit (x[12])
    llValues <- dnorm(diff, sd = x[12], log = TRUE)  # dnorm gives the density...
    if (sum == FALSE) return(llValues)
    else return(sum(llValues))
  }) # end future map
  
  return(ans1) # return to likelihood_parallel function
} # end  likelihood function

# optional, you can also directly provide lower, upper in the createBayesianSetup, see help
prior <- createUniformPrior(lower = refPars$lower[parSel], 
                            upper = refPars$upper[parSel], 
                            best = refPars$best[parSel])

bayesianSetup <- createBayesianSetup(likelihood_parallel, 
                                     prior, 
                                     names = rownames(refPars)[parSel],
                                     parallel =  "external")

# settings for the sampler, iterations should be increased for real applicatoin
settings <- list(iterations = 10, nrChains = 2, message = T)

out <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings)

## Not run: 
plot(out)
summary(out)
marginalPlot(out)
gelmanDiagnostics(out) # should be below 1.05 for all parameters to demonstrate convergence 

florianhartig avatar May 12 '20 16:05 florianhartig

Hi,

OK, first a few comments very similar to #201:

  • Do you have a non-parallel likelihood written for your model? If not, I would definitely start with that before parallelising.
  • Note the comments in our vignette on how to interface existing model with R / BayesianTools https://cran.r-project.org/web/packages/BayesianTools/vignettes/InterfacingAModel.html
  • Yes, if you interface via file I/O, you will probably have to program your own (= external) parallelisation, unless (very unlikely) your file I/O is thread-safe

In general, before I would invest the time in parallelising, I would consider if this is really necessary or helpful. Note that you will usually run at least 3 independent MCMCs for an analysis. Often, people want to do several analyses in parallel. All those runs could be started in independent file directories by independent R sessions on a computer / cluster, without any coding effort. Parallelisation of the likelihood is only useful, if you require speedup and have cores available beyond that. I should add some notes on this in the vignette.

I have to look at your example in more detail. The likelihood makes in sense in principle, I don't know why it doesn't work. Note though that this wouldn't work for SHAW, because you have to make sure that you don't write on the same file. This is usually done by having different folders for each worker, easier to achieve via a traditional (cluster) parallelisation than via future I think.

florianhartig avatar May 12 '20 17:05 florianhartig

Hi Florian,

Thanks for the comments!

I do have a non-parallel likelihood written that seems to work okay. I have HPC access and the model runs are fairly slow (e.g., ~3 min per run), so I have pretty high incentive to parallelize if possible.

The likelihood function I've written is an R wrapper to the actual model. To get around the I/O issue, I've set it up to:

  1. Copy the model files from a main model directory to a new, temporary directory, named after the parameter set passed to the wrapper.
  2. Update the parameter input file in this temporary directory.
  3. Use system commands to run the model in that temporary directory.
  4. Read the outputs from that directory and calculate a likelihood.
  5. Delete the temporary directory.

I've used this method successfully with a manually coded GLUE-style sampler. I tried this with parallel = T and it seems to have run for several hours, then errored out. Do you have any feedback on whether this approach seems reasonable? Maybe this would work if I switch to a traditional cluster parallelization and write it with parallel = "external", as you suggested. To be honest, I don't have a great understanding of what's going on under the hood with different parallelization methods.

When you describe 3 independent MCMCs, you mean 3 chains, correct? As opposed to using the results of one to define starting points of the others, or something? I agree, a poor-woman's parallelization as you described could work well for multiple chains.

I've looked at the documentation you linked, but I'll spend a little more time with it. And I don't know if you should spend too much time looking at my VSEM example. One thing I realized: the documentation says when parallel = "external", the likelihood function should accept a matrix or a vector, I think the function I wrote with the VSEM example only accepts a matrix.

Thank you again for the help and for developing BT! -Adrienne

adrienne-marshall avatar May 12 '20 23:05 adrienne-marshall

Hi Adrienne,

about your likelihood: this sounds reasonable, this is roughly also what I always did when having to deal with models with file I/O.

From there, note the following steps:

  1. If the single likelihood is programmed correctly, it should not be too much work to put this into a parallel loop, and then you should set likelihood = "external". If you set likelihood = T, BT will wrap the likelihood itself in a parallel loop, but for this it has to be thread-safe, and again, I'm not sure how you can prevent your workers from randomly accessing the same file location. If you're absolutely sure the likelihood is tread-safe, you can also use parallel = T. In general, you should maybe consider if it's not worth to get rid of the file output (which is the problem, file input is fine), and rather return this via coupling the Fortan library to R. See an example here https://github.com/trotsiuk/r3PG. This may be a bit more work in the beginning, but faster and more stable in the long run, allows you just to set parallel = T.

  2. About the comments with several jobs: note that if you set nrChains = 3, the three MCMCs will still be calculated sequentially. BT uses the parallelisation for MCMC-internally only. The rationale for us was that doing both parallelisations is technically challenging, and we thought the other parallelization could always be done from the user by hand if needed. So, if you run the DEzs / DREAMzs algorithms, for example, each MCMC runs internally 3 chains, and those are parellelized. But if you run nrChains = 3 (which I would recommend), I would definitely either start 3 jobs independently and then join them with createMcmcSamplerList, or (a bit hacky) you could also wrap this in another parallelisation (in R, it is possible, though not officially recommended, to create nested socket clusters, will probably not work with MPI though)

  3. About your example: yes, I noted this with the vector and fixed this, but still didn't work. But I think more productive to move forward with your example.

florianhartig avatar May 13 '20 08:05 florianhartig

p.s. thanks for your questions - I realize the documentation on this is not ideal and I should update this, but I find it hard to find the time for this lately!

florianhartig avatar May 13 '20 08:05 florianhartig

Hi Florian,

Thanks for the comments. I think the documentation is pretty good - I know it must be a lot of work to keep this package up!

I know you've mentioned that parallel = "external" tends to be error prone in a few points in the documentation; do you have any recommendations about what types of issues tend to create errors with parallel = "external"? And good point about coupling the Fortran library to R; that's a little bit beyond my current knowledge scope, but I'll take a look at r3PG and think about it.

My method to avoid multiple threads writing to the same file is that each model iteration writes to a temporary directory, named after the current set of parameters. I think this should be thread-safe, unless multiple iterations have the exact same set of parameters. Do you know if that's likely/possible with BT?

Based on your comments here, I'll work on getting these running in parallel (likely with parallel = "external") and let you know how it goes or if I have more questions.

Thanks again very much for the help!

adrienne-marshall avatar May 14 '20 20:05 adrienne-marshall

Have you considered using PEST? In my opinion MCMC is more suitable for fast, low dimensional models. PEST was specifically designed for larger, slower models with large numbers of parameters and file i/o. It also has parallelisation via cluster support.

woodwards avatar May 14 '20 20:05 woodwards

I have only minimally looked into using PEST, but I wondered if it might be a good idea. I have my model runs down to about 1 min each by adjusting some modeling choices, and I'm only varying 10 parameters. I feel like I'm close enough to having this working in R with BT that I'd like to stick with that if possible (currently running; seems to be working with parallel = T). I'll look into PEST more, though - parallelisation via cluster would certainly be nice.

Florian, I apologize if this is somewhere in the documentation but I haven't been able to find it - do you have any information about how many cores the algorithms that support parallel computing can make use of? I'm currently using DEzs.

Thanks again!

adrienne-marshall avatar May 20 '20 20:05 adrienne-marshall

Hi Adrienne,

the algorithms can generally use as many cores as you have internal chains, for DEzs, the default is 3!

About your other points: yes, if you create your directories according to the parameters, it should be reasonably thread-safe, the chance to have the same parameters twice seems slim to me. In this case, you should indeed be able to use parallel = T (as you have noted).

About your question of why we say that parallel = "external" is error-prone: there is no particular reason, except for our observation that users tend to do programming errors when trying to code the parallelisation themselves ;)

florianhartig avatar May 20 '20 20:05 florianhartig

I see - thanks!

And good to know about parallel = "external". I have certainly made some coding errors setting this up myself. It's helpful to know that it's likely user error, though - that actually makes me feel more empowered to figure out where I'm going wrong.

For now, this seems to be running okay with parallel = T; we'll see if the run completes.

On Wed, May 20, 2020 at 1:59 PM Florian Hartig [email protected] wrote:

Hi Adrienne,

the algorithms can generally use as many cores as you have internal chains, for DEzs, the default is 3!

About your other points: yes, if you create your directories according to the parameters, it should be reasonably thread-safe, the chance to have the same parameters twice seems slim to me. In this case, you should indeed be able to use parallel = T (as you have noted).

About your question of why we say that parallel = "external" is error-prone: there is no particular reason, except for our observation that users tend to do programming errors when trying to code the parallelisation themselves ;)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/florianhartig/BayesianTools/issues/202#issuecomment-631723487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7TRYXMYVLYD43TEUR3GCDRSRACFANCNFSM4M676YAQ .

adrienne-marshall avatar May 20 '20 21:05 adrienne-marshall