SciMLBenchmarks.jl icon indicating copy to clipboard operation
SciMLBenchmarks.jl copied to clipboard

Summary of issues with ODE Bayesian parameter estimation benchmarks

Open nsiccha opened this issue 1 year ago • 18 comments

For now, I'd take the current ODE Bayesian Parameter estimation benchmarks offline and stop quoting them as up-to date evidence that Julia is currently 3-5 times faster than Stan for ODE problems. If I knew Julia better I'd fix them, but I don't, so here are the issues with them that I currently know:

  • The Lorenz benchmark fails to sample from the correct posterior for either library
  • The Lorenz benchmark uses 500+1000 warmup+samples for Turing, but 1000+1000 for Stan
  • The Lotka-Volterra benchmark samples from two different posteriors (parameters = sigma1.1, sigma1.2, theta_1, theta_2, theta_3, theta_4 for Stan and parameters = theta[1], theta[2], theta[3], theta[4], σ[1] for Turing)
  • Neither benchmark makes it clear what Stan's actual sampling time is, because the time reported through @btime includes compilation
  • DynamicHMC only reports runtime and nothing else that is relevant
  • AFAIK, in the benchmarks Stan always uses adapt_delta=.85 (its default) while Turing uses "adapt_delta=.65", which potentially lowers runtime but also ESS
  • Runtime isn't a good metric anyways, I would probably display per parameter ESS/s histograms/boxplots over several runs with less (post warm-up) samples
  • The interesting problems with ODE models tend to only come up with non-toy problems, but they also tend to take much longer to run, making it difficult (I assume) to include them in automatic benchmarks

nsiccha avatar Sep 06 '22 06:09 nsiccha

Also, any benchmark should probably include multiple chains run in parallel and combined to estimate mixing/ESS.

nsiccha avatar Sep 06 '22 06:09 nsiccha

We do not take benchmarks offline when someone does not like the results. That's not how open science works. They are open for everyone to edit. If you want to see a change, please edit them. As of right now, they are the best measurements we know of. If you don't like them, please submit a pull request. We have on-going pull requests such as https://github.com/SciML/SciMLBenchmarks.jl/pull/495. We do all of our development in the open for everyone to see: taking down the benchmarks is tantamount to stopping development.

The point of open benchmarks is to keep improving them, not to hide them. We don't do that when we look good, and we do not do that when we look bad. We share as much information as possible about the accuracy with the timings so that people can made educated statements about the results given what they see. Given the remarks you have been able to make just by reading the results, it's clear that we have shared enough information for people to see the good and the flaws of the current status of these benchmarks. That means that it's doing it's job sufficiently well of remarking on the current status, and if anything more should be added to further characterize the landscape, not less.

The Lorenz benchmark fails to sample from the correct posterior for either library

That sounds like good information to share online to everyone and to let people see if they can improve it. If there is no setup for these samplers to obtain this posterior for this relatively simple ODE, that's good information to have!

The Lorenz benchmark uses 500+1000 warmup+samples for Turing, but 1000+1000 for Stan

The Lotka-Volterra benchmark samples from two different posteriors (parameters = sigma1.1, sigma1.2, theta_1, theta_2, theta_3, theta_4 for Stan and parameters = theta[1], theta[2], theta[3], theta[4], σ[1] for Turing)

Thanks for pointing that out. Can you submit a PR?

Neither benchmark makes it clear what Stan's actual sampling time is, because the time reported through btime includes compilation

We looked at that in detail. There were changes made to omit compile time:

  • https://github.com/SciML/DiffEqBayes.jl/pull/154
  • https://github.com/SciML/DiffEqBayes.jl/pull/164

Even without that, it's referenced in there that the compile times are around 10 seconds. For a 1000 second benchmark, I do not think the 10 seconds are the biggest deal. It's also not a huge deal for Turing either. I'm not sure why there is such a focus on the 1%.

DynamicHMC only reports runtime and nothing else that is relevant

That should improve, yes. If you have suggestions, please open a PR.

Runtime isn't a good metric anyways, I would probably display per parameter ESS/s histograms/boxplots over several runs with less (post warm-up) samples

Let's do both. That would be most informative. How would I make such ESS/s histograms? Could you share the code, or open a PR?

AFAIK, in the benchmarks Stan always uses adapt_delta=.85 (its default) while Turing uses "adapt_delta=.65", which potentially lowers runtime but also ESS

Thanks for noting this. How would we change this in Stan? It might be easiest to just change Turing.jl and DynamicHMC.jl, but it would be nice to make this explicit and try multiple values. Maybe this could help the Lorenz results as well.

The interesting problems with ODE models tend to only come up with non-toy problems, but they also tend to take much longer to run, making it difficult (I assume) to include them in automatic benchmarks

The benchmark computers have 32-cores and routinely run benchmarks that take upwards of 16 hours. There are plenty of other PDE benchmarks included in this setup, for example there's https://benchmarks.sciml.ai/html/MOLPDE/Filament.html and then there's thousands of ODEs stiff equations https://benchmarks.sciml.ai/stable/Bio/BCR/ . There are big systems, and there are small systems. They use different methods. All matter in some way. If you only look at the ones on "toy" problems, then you will only see "toy" problems, but there are many more in this set if you click to the other pages.

That said, such "toy" problems are actually very similar to the PKPD models used for the estimation of dosing characteristics that was used for the dose predictions for the COVID-19 vaccine. As such, doing big repeated estimations on small sets of ODEs is not something to scoff at but rather is something that is a crucial part of many modern applications and should be optimized like any others (those PKPD estimations are routinely done other large datasets with many patients, so it may be <10 ODEs, but it's in some sense similar to doing tens of thousands of such estimations, so performance matters!)

ChrisRackauckas avatar Sep 06 '22 07:09 ChrisRackauckas

If I knew Julia better I'd fix them, but I don't, so here are the issues with them that I currently know:

that's fine. You can still contribute by sharing some Stan code. That's probably even more helpful since that's the part we know least. If you know how to make Stan output the ESS/s, please share.

ChrisRackauckas avatar Sep 06 '22 07:09 ChrisRackauckas

You are of course free to do whatever you want, but I don't think it's reasonable to publish/keep online, link to and quote benchmarks which are known to have major and minor flaws and which tend to mislead non-experts. Even I had to rerun the code on my local machine and inspect the source code to understand what's going on, and I doubt many other people do that.

That being said, I don't know how to fix the benchmarks in Julia.

nsiccha avatar Sep 06 '22 07:09 nsiccha

These benchmarks are yours too. That's is why they are free to be edited. What we need more than anything is probably more Stan code, so if you have Stan code for the things you're asking for then please share.

ChrisRackauckas avatar Sep 06 '22 08:09 ChrisRackauckas

Hopefully it should be straight forward to change adapt_delta in Stan. See bellow. #258 https://github.com/SciML/SciMLBenchmarks.jl/pull/497

EvoArt avatar Sep 06 '22 08:09 EvoArt

With:

  • https://github.com/SciML/SciMLBenchmarks.jl/pull/495
  • https://github.com/SciML/SciMLBenchmarks.jl/pull/497
  • https://github.com/SciML/SciMLBenchmarks.jl/pull/499

building in https://buildkite.com/julialang/scimlbenchmarks-dot-jl/builds/965, big chunks of this are handled. What I'm really missing are ways to get the ESS/s and do direct gradient timings. The Turing devs (@devmotion) can probably do the Turing side quite easily, though we might need some help on the Stan code.

ChrisRackauckas avatar Sep 06 '22 18:09 ChrisRackauckas

For gradient comparisons one could use or build on https://github.com/torfjelde/TuringBenchmarking.jl (there was some discussion about moving it to TuringLang).

ess_rhat in MCMCChains already computes ESS/s if the computation time was tracked during sampling (which is the default in Turing): https://beta.turing.ml/MCMCChains.jl/dev/diagnostics/#MCMCDiagnosticTools.ess_rhat-Tuple{Chains} Since there is no completely standardized way to estimate ESS (MCMCDiagnosticTools includes three different algorithm which can yield quite different estimates sometimes, and they differ a bit from the algorithm in Stan (which we also want to add) which can yield yet another estimate), I think the fairest comparisons would be achieved by using exactly the same algorithm for all sampling algorithms. That could be achieved eg if StanSamples would track the computation time and save it in the MCMCChains.Chains object when constructing it.

devmotion avatar Sep 06 '22 18:09 devmotion

The Lotka-Volterra benchmark samples from two different posteriors (parameters = sigma1.1, sigma1.2, theta_1, theta_2, theta_3, theta_4 for Stan and parameters = theta[1], theta[2], theta[3], theta[4], σ[1] for Turing)

@devmotion can you help with sorting this one out? Once that's done, we can close this since all that's left is a duplicate of https://github.com/SciML/SciMLBenchmarks.jl/issues/494 if I haven't missed anything.

We should also make DynamicHMC print out more, though I'm not sure how to do that.

For gradient comparisons one could use or build on https://github.com/torfjelde/TuringBenchmarking.jl (there was some discussion about moving it to TuringLang).

Cool! That would be good to include.

ChrisRackauckas avatar Sep 06 '22 18:09 ChrisRackauckas

I think the fairest comparisons would be achieved by using exactly the same algorithm for all sampling algorithms.

Yeah, that's what should be done.

What we need more than anything is probably more Stan code, so if you have Stan code for the things you're asking for then please share.

I think the Stan code/model is fine (where it defines the same posterior as the Turing code). I think what can be changed/improved is passing the right compile flags to Stan, but I wouldn't know how to do that from Julia.

Flags that seem to work and improve performance a bit (but not a lot) are:

STANCFLAGS+=--O1
STAN_CPP_OPTIMS=true
CXXFLAGS+= -O3

which should go into the cmdstan make/local. But no idea how this works using Julia and on the benchmarking machines :shrug:

The other thing that should be improved is postprocessing/reporting of results, but as @devmotion explained, there are several options to estimates ESS, and I simply don't know the Julia ecosystem well enough to know which implementation to use or how to use it.

nsiccha avatar Sep 07 '22 07:09 nsiccha

I think what can be changed/improved is passing the right compile flags to Stan.

IMO that would not make the benchmark fairer. I think the correct thing to do is to benchmark the Julia packages with the default compile flags (they are not optimized or tuned for the benchmarks) with Stan with the default compile flags. Of course, alternatively one could try to hyperoptimize everything as much as possible - but that's less realistic as users will just go with the default flags, I assume. If Stan's default compile flags are suboptimal they should be changed by Stan but I don't think it's fair to tune thrm for this benchmark.

but as @devmotion explained, there are several options to estimates ESS, and I simply don't know the Julia ecosystem well enough to know which implementation to use or how to use it.

As I wrote, since StanSample.jl already reports results in the same output format as Turing (and other packages) and this format already supports ESS/s, the only missing piece is that StanSample.jl has to include the computation time when constructing its output. Then estimates would be based on the same algorithm and comparisons would be fair.

devmotion avatar Sep 07 '22 07:09 devmotion

IMO that would not make the benchmark fairer.

I disagree. In my experience, people fitting ODE models generally care about performance, and this includes changing some easily accessible compiler flags for Stan.

Similarly, you could just let Julia fit the model that "the average user" would code up, but this would be much slower?. Similarly, disabling logging speeds fitting up considerably, should you do that for the benchmark or not?

Either of the choices for Julia will have a bigger impact on performance than enabling the "standard" Stan performance flags.

nsiccha avatar Sep 07 '22 08:09 nsiccha

FWIW, I tend to agree with @nsiccha. Playing with compiler flags seems to be the norm for aggressively optimised Stan models. Since Turing models are being optimised by actual Turing devs, this doesn't seem unreasonable.

EvoArt avatar Sep 07 '22 09:09 EvoArt

Surely the models (regardless of whether it's Stan or Turing or ...) should be optimized, and any help for optimizing the Stan models is very much appreciated (BTW I don't think the Turing models are completely optimized, definitely not if you're using DiffEqBayes - but that's a different topic). But there's no special compiler optimization or tricks going on in the Turing side - we just benchmark the package as it is and as users would install it. Similarly, maybe for Stan one should just install cmdstan via Conda? It seems that's what users on Windows, e.g., are supposed to do anyway: https://mc-stan.org/docs/cmdstan-guide/cmdstan-installation.html

devmotion avatar Sep 07 '22 09:09 devmotion

Our benchmark rules are: https://github.com/SciML/SciMLBenchmarks.jl#rules-optimal-fair-and-reproducible

These benchmarks are meant to represent good optimized coding style. Benchmarks are preferred to be run on the provided open benchmarking hardware for full reproducibility (though in some cases, such as with language barriers, this can be difficult). Each benchmark is documented with the compute devices used along with package versions for necessary reproduction. These benchmarks attempt to measure in terms of work-precision efficiency, either timing with an approximately matching the error or building work-precision diagrams for direct comparison of speed at given error tolerances.

So I think it's fair to use the better compiler flags. We try to setup everyone's "optimization tricks" wherever possible, but then also document their effects. So it would be good for example if we could have a version with special compiler flags and the default: that would demonstrate the amount of performance improvement due to the flags. It wouldn't be too hard to just have two Stan binaries built and swap the call?

We normally do this kind of gradation for everything. Stan is a bit of a PITA here since it's an order of magnitude harder to install than other packages, but now that we have a basis to work from it shouldn't be too bad.

BTW I don't think the Turing models are completely optimized, definitely not if you're using DiffEqBayes - but that's a different topic

They are definitely not, which is why I added the direct version so they can start being optimized. They are definitely not close to optimized yet though.

ChrisRackauckas avatar Sep 07 '22 16:09 ChrisRackauckas

So it would be good for example if we could have a version with special compiler flags and the default

Yeah, I think it would also be a good idea to show Julia implementation which are optimized to different degrees, just so that people know what's possible and how big of an impact each step might have. But maybe this only encourages people to implent their model suboptimally...

nsiccha avatar Sep 07 '22 17:09 nsiccha

I'm guessing it can also happen that some "optimizations" can even have an adverse effect in some cases, which would also be interesting to see.

nsiccha avatar Sep 07 '22 17:09 nsiccha

I'm guessing it can also happen that some "optimizations" can even have an adverse effect in some cases, which would also be interesting to see.

Indeed, like how in https://benchmarks.sciml.ai/dev/MultiLanguage/ode_wrapper_packages/#Stiff-Problem-2:-HIRES static arrays are worse than using in-place, even though the ODE is small. That's why we just try to show it all: the main use case of the benchmarks is just internal to help find performance regressions and figure out what the optimal way to do things is to change tutorials. For example, the current result of these Bayesian benchmarks has been to speedup the rate at which we deprecate the Turing wrapper, which is being made clear through the current results.

But maybe this only encourages people to implent their model suboptimally...

This is a completely separate repo from the docs and tutorials, so hopefully people aren't using it as a tutorial for anything other than how to choose methods

ChrisRackauckas avatar Sep 07 '22 18:09 ChrisRackauckas