bridgesampling Parallelize sampling for gen_samples and q22 with multiple repetitions

Sampling from a multivariate normal distribution can take a long time. When running bridge sampling with multiple repetitions, sampling to construct gen_samples and q22 currently takes place serially, even if cores > 1. This pull request modifies bridge_sampler_normal.R and bridge_sample_warp3.R to sample for these in parallel when repetitions > 1 (if cores is also >1).

Some preliminary testing indicated that this could potentially save a lot of time when running several repetitions. I used a brm fit with a fairly complex model fit using 20k samples/chain and 4 chains. Here's the model formula, in case it matters:

RT ~ Verb.Type.N * Sentence.Voice.N + me(s.voice.log.odds, se.log.odds) +
    (1 + Verb.Type.N * Sentence.Voice.N + me(s.voice.log.odds, se.log.odds) | Subject) +
    (1 + Verb.Type.N * Sentence.Voice.N + me(s.voice.log.odds, se.log.odds) | Stimulus.Number)

I measured how long it took to run 4 repetitions of the bridge sampler (method = "normal") using Sys.time(). The current serial implementation took 33.16 minutes, while the parallelized version 14.36 minutes, a difference of 18.8 minutes. I don't have local resources available to run many more repetitions than that with a model of this size, but I imagine it might scale nicely, since there's a noticeable time savings even with only 4 repetitions.

I will note that I ran this on Windows using parallel::makeCluster rather than parallel::mclapply, after bypassing the check against running multicore on Windows, so I'm not sure if that code works as is on Unix-based systems, nor do I have an estimate of the time savings on those systems—someone with a Unix-based system should do some testing if this change turns out to be of interest.

In addition, I don't know whether there is some potential concern about RNG seeding across the cores that might compromise the results; if so, then just eating the time cost for running multiple repetitions may be the way to go. If not though, parallelizing these draws might result in significant time savings for people running multiple repetitions.

Apr 12 '23 14:04 mawilson1234

Hi Michael,

Increasing the speed of bridgesampling by getting faster multivariate normal samples sounds like a good idea. Maybe an alternative approach that could work might be replacing the call to rmvnorm with a call to mvnfast. If the bottleneck is really sampling that might just solve it. IIRC, mvnfast supports multicore out of the box without the need to specify a cluster.

Might make sense to consider our options before committing to this specific addition.

Cheers, Henrik

Apr 12 '23 19:04 singmann

Hi Henrik,

I had tried using rmvn/dmvn from mvnfast as well, but it didn't seem to be helping (at least, with my setup). Trying to run it serially with the same setup was taking much longer than mvtnorm. Using the default behavior with mvtnorm took ~33 minutes (as above), but I had to manually quit when trying mvnfast after ~2.5 hours on the loop that gets gen_samples and q22. Not sure if it's just something about my setup that's causing issues; I did make sure to turn off BLAS parallelism as mentioned here.

I also see that the merge checks are failing with the error:

Error: Unable to resolve action `r-lib/actions@master`, unable to find version `master`.

Not sure what is causing that.

At any rate, it's not a big deal if you decide to take a different approach (or leave things as is, since they're working right now and I certainly wouldn't want to break anything accidentally). I had just been messing around with the code since I was trying to run several repetitions, and noticed that there might be some possible time savings in that area, so I figured I'd go ahead and put in the pull request.

Thanks for your response! Michael

Apr 12 '23 19:04 mawilson1234

Some further testing on a smaller model showed that (for 4 repetitions on 4 cores) the current implementation took 1.15 minutes, replacing the calls to mvtnorm with the corresponding calls to mvnfast took 1.17 minutes, and the changes in the pull request took 0.82 minutes. Though none of these are very long, this was a much smaller model than the previous one (10k samples * 4 chains instead of 40k samples * 4 chains previously), and it was much less complex, since the larger set of samples were coming from a measurement error model. (Though it's possible mvnfast could scale better, despite the apparent bottleneck I had run into before.)

However, I was running into issues with errors about exporting the variables rcppFile and "stanfit" being undefined. I got past them by setting varlist and rcppFile to NULL in the arguments to .bridge.sampler.normal, but I'm not sure what the implications of that might be.

Apr 12 '23 21:04 mawilson1234

bridgesampling bridgesampling copied to clipboard

Parallelize sampling for gen_samples and q22 with multiple repetitions

bridgesampling
bridgesampling copied to clipboard