cmdstan
cmdstan copied to clipboard
Optional profiling
Profiling is a great tool for debugging and performance optimization, but it also adds a bit of overhead to the execution of the Stan program. So it would be nice to skip the profiling statements whenever you don't need them and want the model to run as fast as possible - without the need to manually remove them from the stan program or to always keep two version of the same program (one with and one without profiling statements).
Ideal solution
Ideally, there would be an additonal profile = FALSE/TRUE argument to the sample function of a compiled stan model that would be passed on to CmdStan.
Workaround
As a workaround, one can automatically create a temporary copy of the stan program with all profiling statements removed, compile this temporary copy, and sample from this model without profiling. This requires no changes to CmdStan. The disadvantage is of course that this requires a second compiling of the stan model.
I have implemented this workaround as a wrapper of cmdstan_model in this repo. We have been using it during the development work for epinowcast and it has been quite robust. It also supports included files. It would also be possible to use this workaround for a profile = FALSE/TRUE argument in the sample function.
Context
Regarding the ideal solution of integrating this functionality with CmdStan, I lack the expertise to judge how difficult that would be and probably could not do it myself. If people think a workaround like above is acceptable, I'm happy to help implementing it for cmdstanr, but would need some guidance (and a few details still need to be sorted out).
As you've identified, this is a change that would have to be made in cmdstan (but I also see the benefit!). I'll transfer this issue to the cmdstan repo so that you can get feedback from the right people!
Hi, @adrian-lison. Is there a performance hit in Stan runs that don't include profiling statements, and if so, how much? If it's beyond a couple percent, then I'd be in favor of only conditionally including anything to do with profiling.
@bob-carpenter profiling is done with an object that uses the constructor and destructor to registers some callbacks on the autodiff stack for timing, so it should be 0 cost if you don't use it (minus the cost of creating an empty std::map at model class initialization).
@adrian-lison do you have an example model where profiling itself is a non trivial overhead compared to the evaluation of the density itself? It seems like it should be very cheap, glancing at the code, but I'm happy to be proven wrong and look into this feature request more.
Hey @WardBrian, thanks for your reply. I would agree that in many cases, the overhead is probably negligible compared to the density evaluation. It should only become an issue when lots of profiling callbacks are made. A very contrived example would be the following linear regression model.
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
for (i in 1:N) {
profile("predictor") { // comment out for no profiling
mu[i] = alpha + beta * x[i];
}
}
}
model {
alpha ~ std_normal();
beta ~ std_normal();
sigma ~ std_normal(); // half-normal by constraint
y ~ normal(mu, sigma);
}
With this model, I get an approx. 5-6x runtime increase per chain when profiling (tested for 1000, 2500 and 5000 data points). Obviously, this model implementation is very stupid. The question is if there are many realistic scenarios in which people would be forced to use a lengthy loop and want to profile a specific subsection of that loop. This could then be a relevant case where the profiling overhead starts to matter.
One potential example from my domain is infectious disease modeling, where we often require loops to model infection and ascertainment processes, e.g. here, and if the individual computations within the loop become more complicated, it might be interesting to profile them (we haven't specifically assessed profiling overhead there however). I also wonder how profiling within map(-reduce) function calls would play out.
I see two potential solutions to this:
- Wrap the body of the constructor of the
profileclass and the entire destructor in an#ifdef - Add a flag to the Stan compiler to skip the emitting of the profile objects entirely.
The first seems more in line with how we currently do things (e.g. STAN_NO_RANGE_CHECKS) however this also necessitates some makefile hackery to try to get the file to re-build if you change the setting, and it does not always work. You can see this now with something like STAN_THREADS
$ make clean
$ make examples/bernoulli/bernoulli
# builds
$ make STAN_THREADS=true examples/bernoulli/bernoulli
# triggers a re-build
$ make examples/bernoulli/bernoulli
make: 'examples/bernoulli/bernoulli' is up to date.
A compiler option would skip the object creation entirely, and in theory avoids this re-compilation issue (which can actually be even more subtle than the above if make clean isn't run first), but in practice stanc is not re-run if the stancflags change