stan icon indicating copy to clipboard operation
stan copied to clipboard

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023)

Open bbbales2 opened this issue 4 years ago • 28 comments

Submission Checklist

  • [x] Run unit tests: ./runTests.py src/test/unit
  • [x] Run cpplint: make cpplint
  • [x] Declare copyright holder and open-source license: see below

Summary

Fix for #3023

Side Effects

I changed how the return codes from covar_adaptation::learn_covariance and var_adaptation::learn_variance worked.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

  • Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
  • Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

bbbales2 avatar Mar 19 '21 15:03 bbbales2

@funko-unko want to verify this fixes the problem?

To get a copy of cmdstan with this patch do:

git clone --recursive https://github.com/stan-dev/cmdstan cmdstan-issue-3023
cd cmstan-issue-3023/stan
git checkout bugfix/issue-3023
cd ../
make build

bbbales2 avatar Mar 19 '21 15:03 bbbales2


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.47 3.42 1.01 1.43% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.46% slower
eight_schools/eight_schools.stan 0.11 0.11 0.96 -3.67% slower
gp_regr/gp_regr.stan 0.16 0.16 1.0 -0.5% slower
irt_2pl/irt_2pl.stan 5.31 5.23 1.01 1.42% faster
performance.compilation 91.43 88.53 1.03 3.17% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 9.01 8.89 1.01 1.3% faster
pkpd/one_comp_mm_elim_abs.stan 31.32 30.09 1.04 3.91% faster
sir/sir.stan 130.41 131.79 0.99 -1.06% slower
gp_regr/gen_gp_data.stan 0.04 0.04 0.97 -3.6% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.13 3.15 0.99 -0.61% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.37 0.39 0.94 -6.04% slower
arK/arK.stan 1.91 1.9 1.01 0.65% faster
arma/arma.stan 0.63 0.71 0.89 -12.22% slower
garch/garch.stan 0.52 0.51 1.02 2.17% faster
Mean result: 0.99084398021

Jenkins Console Log Blue Ocean Commit hash: 4b0ae48d60f96f6435902bb136fbb8c3e55e7de9


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU: Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++: Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin15.6.0 Thread model: posix

Clang: Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin15.6.0 Thread model: posix

stan-buildbot avatar Mar 19 '21 17:03 stan-buildbot

I'll try it out later.

funko-unko avatar Mar 19 '21 19:03 funko-unko

Changing the return code means that stepsize adaptation does not restart after the last metric window. The last window is usually longer than term_buffer (the defaults give last window size 500 compared to term_buffer=50). Does that have some bad side effects?

I think it would be more conservative to simply change the restart value of x_bar_. https://github.com/stan-dev/stan/blob/37ad8c36b83299d04fd9da824d324fa6517fc6e0/src/stan/mcmc/stepsize_adaptation.hpp#L49-L53 The first call to learn_stepsize() overwrites x_bar_ anyway so the initial value here is irrelevant unless you call complete_adaptation() immediately without learn_stepsize() in between (i.e. term_buffer = 0), in which case the stepsize becomes exp(x_bar_). The sampler sets mu_ = log(10*stepsize) right before restart() so the natural restart value for x_bar_ would be mu_ - log(10).

nhuurre avatar Mar 20 '21 16:03 nhuurre

Is the trick that:

x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

1.0 - x_eta is always zero on the first iteration cause counter is always 1?

I think the annoying thing here is that we deal with the magic constant 10 in two places. I wouldn't mind getting rid of the 10x and restarting at x_bar_ equals mu_ re: this

does not restart after the last metric window.

I think this should still restart the step size adaptation if there is a term buffer following. At least that was the intention. If there is no term buffer, there is no reset.

The logic to detect the end of the last window with a term buffer is:

void compute_next_window() {
    if (adapt_next_window_ == num_warmup_ - adapt_term_buffer_ - 1)
      return;

And the finished thing here is detecting instead:

bool finished() {
    if (adapt_window_counter_ + 1 >= num_warmup_) {
      return true;

bbbales2 avatar Mar 20 '21 17:03 bbbales2

1.0 - x_eta is always zero on the first iteration cause counter is always 1?

Yes

I think the annoying thing here is that we deal with the magic constant 10 in two places.

No, not two.

https://github.com/stan-dev/stan/blob/37ad8c36b83299d04fd9da824d324fa6517fc6e0/src/stan/mcmc/hmc/nuts/adapt_dense_e_nuts.hpp#L38 https://github.com/stan-dev/stan/blob/37ad8c36b83299d04fd9da824d324fa6517fc6e0/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L38 https://github.com/stan-dev/stan/blob/37ad8c36b83299d04fd9da824d324fa6517fc6e0/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp#L90 https://github.com/stan-dev/stan/blob/37ad8c36b83299d04fd9da824d324fa6517fc6e0/src/stan/services/sample/hmc_nuts_diag_e_adapt.hpp#L89

(And that's just nuts, there's also static_hmc)

I wouldn't mind getting rid of the 10x

Related to issue #2670

That 10x isn't the only janky thing about this. init_stepsize() seems to hardcode adapt_delta=0.8 https://github.com/stan-dev/stan/blob/37ad8c36b83299d04fd9da824d324fa6517fc6e0/src/stan/mcmc/hmc/base_hmc.hpp#L103 (and shouldn't that be abs(delta_H) > abs(std::log(0.8))?

I think the "correct" stepsize multiplier when changing windows is probably something like the largest eigenvalue of (old metric / new metric).

I think this should still restart the step size adaptation if there is a term buffer following.

Ah, you're right. adapt_window_counter_ doesn't reach num_warmup_ if term_buffer_ > 0. Nevermind then.

nhuurre avatar Mar 20 '21 18:03 nhuurre

The current behavior is intended.

After the final windowed adaptation update there is little guidance on the step size for the terminal buffer window. The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}.

The dual averaging adaptation sets mu to 10 times the initial epsilon to encourage more aggressive adaptation early on; this is a common heuristic for setting mu when optimization.

The objective of the step size initialization is only to find a rough value and the precise adaptation target is not especially important. In particular the initialization considers the acceptance probability at one step which is not characteristic of the acceptance probability averaged over the entire trajectory (either uniformly weighted as is currently implemented or properly weighted as it should be implemented).

Because there is not great way to predict a good step size after updating the inverse metric a non-zero terminal buffer window was always intended to be run. Given that users are running without a terminal buffer window a message warning users that the step size will not be adapted with term_buffer = 0 would certainly be helpful.

betanalpha avatar Mar 23 '21 19:03 betanalpha

I don't know whether I'm the only one doing this, but please don't issue a warning and please report the last used step size.

Thanks.

funko-unko avatar Mar 23 '21 21:03 funko-unko

The current behavior is intended.

Well that doesn't mean we can't change it.

I don't really see the importance of setting this to 1.0. When we continue adaptation we're setting it to 10x the old timestep anyway. If anyone wants to use the stepsize 1 for something, they can just use stepsize 1.

Sure the adaptation might be bad, but the behavior is confusing, stepsize = 1 will probably be bad too, and this is an easy to fix.

bbbales2 avatar Mar 23 '21 23:03 bbbales2

I agree with @bbbales2 & @funko-unko that despite it's not a practice we want to recommend if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle.

yizhang-yiz avatar Mar 24 '21 18:03 yizhang-yiz

My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use term_buffer!=0, as described here https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko

funko-unko avatar Mar 25 '21 06:03 funko-unko

The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}.

I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution. Notes attached if someone wants to check my math. stepsize-bounds.pdf

nhuurre avatar Mar 27 '21 10:03 nhuurre

@betanalpha Any further thoughts on this? I would like to make this change.

bbbales2 avatar Mar 30 '21 22:03 bbbales2

Oh, right, I'll have a look whenever my current computation finishes...

funko-unko avatar Mar 31 '21 08:03 funko-unko

To directly address #3023

I added a direct test. It seems a little weird without context (like someone coming along might not understand why the stepsize should not be 1), so I'm down to add a comment about why it is there, but it serves the purpose.

Btw don't approve or merge on this until @betanalpha okays it or we just disagree and vote on it. Thanks for the review though.

bbbales2 avatar Apr 03 '21 20:04 bbbales2


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.39 3.34 1.01 1.34% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.04% slower
eight_schools/eight_schools.stan 0.12 0.11 1.04 4.03% faster
gp_regr/gp_regr.stan 0.16 0.16 1.01 0.61% faster
irt_2pl/irt_2pl.stan 5.95 6.0 0.99 -0.74% slower
performance.compilation 91.11 89.51 1.02 1.76% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.63 8.74 0.99 -1.31% slower
pkpd/one_comp_mm_elim_abs.stan 28.63 28.54 1.0 0.32% faster
sir/sir.stan 123.0 136.59 0.9 -11.05% slower
gp_regr/gen_gp_data.stan 0.03 0.04 0.98 -1.68% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.12 3.06 1.02 1.96% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.4 1.0 -0.2% slower
arK/arK.stan 1.87 1.88 1.0 -0.37% slower
arma/arma.stan 0.85 0.63 1.34 25.4% faster
garch/garch.stan 0.6 0.56 1.08 7.42% faster
Mean result: 1.02476505228

Jenkins Console Log Blue Ocean Commit hash: 40ee1c23a0eb58472153d83b0256a37894a2bbdd


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU: Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++: Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin15.6.0 Thread model: posix

Clang: Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin15.6.0 Thread model: posix

stan-buildbot avatar Apr 03 '21 21:04 stan-buildbot

Let me reiterate the intended behavior of the adaptation.

The init buffer gives the sampler time to find the typical set, adapting the step size along the way to facilitate the search.

The intermediate adaptation windows try to aggregate information about the global covariance structure. In these windows the inverse metric is not adapted dynamically, which is prone to unstable behavior, but the step size is. After each adaptation window the inverse metric is updated with the aggregated information and the step size is reset to a default near 1 because there is no other decent guess.

Because of this compartmentalization the intermediate windows never leave the sampler configuration in a good state. This isn’t a problem when applying another intermediate window because the step size is adapted within those windows, but it is when the final intermediate window has completed. The entire role of the terminal buffer is to learn a final stepwise and bring the sampler back into a good state. In other words term_buffer = 0 is not appropriate whenever there are intermediate adaptation windows.

term_buffer = 0 was included in case the adaption windows were turned off. In other words if one wanted to run only an initial buffer window to learn a step size but keep the inverse metric at its initial value (either the identify or whatever is specified externally).

A warning if term_buffer = 0 but window != 0, as well as additional documentation, is entirely appropriate here. Trying to hack the adaptation to allow fragile, unintended use cases is not.

I agree with @bbbales2 https://github.com/bbbales2 & @funko-unko https://github.com/funko-unko that despite it's not practice a we want to recommend but if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle https://en.wikipedia.org/wiki/Principle_of_least_astonishment.

The problem with the least surprise principle in general is that it’s subjective; surprise will always be defined by prior expectations which are not uniform across individuals.

For example here the problem isn’t in the behavior of the terminal adaptation buffer but rather in misconceptions about how the intermediate adaptation windows work.

My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use term_buffer!=0, as described here https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unkoDid you find any documentation explaining that this is a valid use case of the adaptation?

I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution. Notes attached if someone wants to check my math.

The exact result can be worked out by generalizing the derivation in A.3 of https://arxiv.org/abs/1411.6669 to include non-identify covariance and/or metric.

Is the jump from || . ||_{-} to | | correct here? L^{-1} M^{-1} isn’t positive definite and so the distribution of acceptance probabilities shouldn’t be symmetric. Indeed even in the normal case the Hamiltonian error doesn’t oscillate symmetrically around zero but rather is biased towards larger energies.

In any case as you note the limitation here is that the result holds only for one short step, but that short step recovers the Langevin limit. The optimal Langevin acceptance probability and step size are not the same for Hamiltonian Monte Carlo with nontrivial integration times.

betanalpha avatar Apr 05 '21 16:04 betanalpha

After each adaptation window the inverse metric is updated with the aggregated information and the step size is reset to a default near 1

Actually, the stepsize does not reset to one between the windows. Instead the algorithm invokes init_stepsize() which doubles or halves the stepsize a few times, aiming for Langevin step accept probability in the neighbourhood of 0.8. If you have term_window >= 1 this is the stepsize you get. The "surprising" reset-to-one behaviour is the result of how stepsize adapter restarts. The initial value of x_bar_ is set to zero instead of log of the current stepsize. The initial value is essentially irrelevant because the dual averaging gives it zero weight; the first real update erases it completely--unless the window is empty and no update ever happens.

Is the jump from || . ||_{-} to | | correct here? L^{-1} M^{-1} isn’t positive definite and so the distribution of acceptance probabilities shouldn’t be symmetric. Indeed even in the normal case the Hamiltonian error doesn’t oscillate symmetrically around zero but rather is biased towards larger energies.

I'm only using the leading order epsilon^3 term for the single-step energy error. That term is linear in p and the distribution of p is certainly symmetric. The bias you note is of order epsilon^4 so it's negligible in this approximation. (The approximation may well be too crude to be interesting.)

nhuurre avatar Apr 05 '21 18:04 nhuurre

step size is reset to a default near 1 because there is no other decent guess

As far as I know the initial guess for the stepsize in the next interval is 10x the last one.

in misconceptions about how the intermediate adaptation windows work.

Nah nah I don't think it's a misconception -- I see what you're saying here (after a metric is updated, because the stepsize corresponds to the previous metric we'd want to update it).

I don't think that changing the behavior for term_buffer = 0 is going to compromise anything. Like sure the sampler is in a dubious state, but the user requested term_buffer = 0 so, yeah, that's intentional, and they're following up with more adaptation where the stepsize can change (and doing a non-zero term-buffer at the end of it all).

(The approximation may well be too crude to be interesting.)

I do like the idea of not doing the 10x, which seems pretty crude.

@funko-unko how are you currently working around this?

bbbales2 avatar Apr 06 '21 04:04 bbbales2

Oh yeah I guess -- is everyone here comfortable if we just try to do a vote on this? Like, the SGB rules say votes for major issues and this seems quite small, but I'm happy just manufacturing a decision and rolling with it rather than arguing back and forth too long.

If I'm stopping the discussion too early just say so.

Niko's timestep adaptation seems interesting re: the discussions here + the thing here so either way this PR goes we can still keep that it mind.

bbbales2 avatar Apr 06 '21 04:04 bbbales2

As far as I know the initial guess for the stepsize in the next interval is 10x the last one.

It's not. See this code https://github.com/stan-dev/stan/blob/e064647320f4a0837904bc98103c5aac919ef914/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L36-L38 The call to init_stepsize() adjusts the stepsize between the intervals and set_mu does not set the initial guess but some kind of shrinkage target.

is everyone here comfortable if we just try to do a vote on this?

I don't think we're at the point where people agree to disagree. It looks more like people can't even agree what the current behaviour is.

I made this plot with init_buffer=20 term_buffer=20 window=30. Here epsilon is the current stepsize, mu is the shrinkage target stepsize and xbar is the stepsize we would have if adaptation stopped right now.

adapt-stepsize

  • epsilon moves continuously
  • mu is constant over the whole window, fixed at 10x the first epsilon.
  • xbar resets to 1.0 between windows but then immediately jumps to meet epsilon. Upthread I proposed changing that last part--just make xbar restart at epsilon.

nhuurre avatar Apr 06 '21 07:04 nhuurre

@bbbales2

@funko-unko how are you currently working around this?

Currently I'm just using the step sizes used for the last sample as extracted from the output csv to pass to the next warmup step. (Actually currently I compute the mean across chains, but really I haven't settled on anything yet).

@betanalpha

Did you find any documentation explaining that this is a valid use case of the adaptation?

No, do you think this is an invalid use case? Is there any metric that could answer that question?

funko-unko avatar Apr 06 '21 10:04 funko-unko

set_mu does not set the initial guess but some kind of shrinkage target.

I'll have to read up on what the dual averaging thing does. I'm clueless here. Thanks for the plot.

Currently I'm just using the step sizes used for the last sample as extracted from the output csv

So this is the stepsize pre-stepsize reset, right? So you're able to get the thing you need it's just kinda inconvenient?

bbbales2 avatar Apr 06 '21 15:04 bbbales2

term_buffer = 0 was included in case the adaption windows were turned off. In other words if one wanted to run only an initial buffer window to learn a step size but keep the inverse metric at its initial value (either the identify or whatever is specified externally).

I double-checked and that's not supported either. Regardless of whether window is zero or not, term_buffer=0 causes the adaptive sampler to reset stepsize to 1. And window=0 also forces the metric to 0.001 times the identity matrix. This PR does not fix the metric reset issue.

nhuurre avatar Apr 06 '21 15:04 nhuurre

So this is the stepsize pre-stepsize reset, right?

I think so, yes.

So you're able to get the thing you need it's just kinda inconvenient?

I wouldn't say I get what I need, as I do not know what I need. Using a stepsize of one did appear to work. Using the chainwise last step size appeared to work slightly better. Using the mean/median across chains appeared to work slightly better still. Using the maximal step size across chains worked well sometimes, badly other times.

funko-unko avatar Apr 07 '21 09:04 funko-unko

I double-checked and that's not supported either. Regardless of whether window is zero or not, term_buffer=0 causes the adaptive sampler to reset stepsize to 1. And window=0 also forces the metric to 0.001 times the identity matrix. This PR does not fix the metric reset issue.

There are two issues here.

One is the behavior of dual averaging for vanishing window sizes (the term_buffer = 0 problem). This can resolved in two ways, both of which require a second restart signature with an initial step size argument. The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior and will have to be tested externally, or it can be saved as a member variable with complete_adaptation updated to something like

void complete_adaptation(double& epsilon) { 
  if (counter_ > 0) 
    epsilon = initial_epsilon_; 
}

which does not change the optimization behavior.

The other is that when adapt_base_window_ = 0 the window termination code is called after every iteration between adapt_init_buffer and adapt_term_buffer_ which continuously restarts the step size adaptation and tries to update the inverse metric. In some sense the behavior with adapt_base_window = 0 and adapt_init_buffer + adapt_term_buffer < n_warmup is undefined, but having any iterations between adapt_init_buffer and num_warmup - adapt_term_buffer no op is probably the best option.

The quick fix here is changing end_adaptation_window mcmc/windowed_adaptation.hpp from

  bool end_adaptation_window() {
    return (adapt_window_counter_ == adapt_next_window_)
           && (adapt_window_counter_ != num_warmup_);
  }

to

  bool end_adaptation_window() {
    return (adapt_window_counter_ == adapt_next_window_)
           && (adapt_window_counter_ != num_warmup_)
           && (adapt_base_window_ != 0);
  }

and adding a warning to set_window_params like

“WARNING: Because base_window = 0 only a single step size adaptation will be performed across the all warmup transitions."

betanalpha avatar Apr 15 '21 18:04 betanalpha

when adapt_base_window_ = 0 the window termination code is called after every iteration between adapt_init_buffer and adapt_term_buffer_ which continuously restarts the step size adaptation

Not quite. adapt_next_window_ is not the first iteration of the next window but the last iteration of the current window. When adapt_base_window_ = 0 it's always the last iteration of the init buffer. There is only one reset if init_buffer > 0 and no reset at all if init_buffer = 0. In effect, the term buffer extends to fill the void left by the missing metric adaptation windows. I do like your quick fix. It makes it clear that the behavior is intentional and also prevents the reset at the end of the init buffer. That solves the same problem as #3037 in a cleaner manner.

The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior

The initial value of x_bar_ does not change the optimization behavior. learn_stepsize() updates x_bar_ according to

x_eta = std::pow(counter_, -kappa_);
x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

The first update has counter_ = 1. Regardless of the value of kappa_ the weight (1.0 - x_eta) is 0.0 and any finite initial value of x_bar_ is completely erased.

I did notice a couple of places where the optimization behavior probably should be changed. (or at least doesn't make sense to me)

  1. The first is visible in the plot I posted upthread. adapt-stepsize You can see that mu_ (the green line) is higher in the init buffer than in the term buffer even though the init buffer starts with a smaller stepsize than the term buffer. What's going on, isn't mu_ supposed to be 10 times the initial stepsize? The answer is that for the initial window mu_ is set before the first init_stepsize() but for later windows mu_ resets after init_stepsize(). That seems inconsistent. The stepsize is not very good before init_stepsize().

  2. This discussion has been concerned with the "surprising" result when term_buffer = 0. But let's take a quick look at term_buffer = 1. Of course the step size then depends on what the accept_stat__ is for that single adaptation transition but I think that if the accept_stat__ just happens to equal adapt_delta then the "least surprising" result is that the step size does not change. It might not be exactly optimal but there's no clear reason to move it in either direction, right? The current behavior in that scenario is to set the step size to exp(mu_), i.e. 10 times the initial stepsize which feels a bit too large. There's an easy fix: make restart() initialize s_bar_ = gamma*log(10) instead of the current s_bar_ = 0. The most obvious side effect is removing the high spike right after every metric update. stepsize-trace

  3. The algorithm for updating the step size during adaptation is (n is the iteration number (counter_))

eta   <- 1/(t0 + n)
s_bar <- s_bar*(1-eta) + eta*(adapt_delta - accept_stat)
x     <- mu - s_bar * eta*sqrt(n)/gamma
epsilon <- exp(x)

Alternatively this can be written as

x <- mu + (x - mu)*(1-1/(t0+n))*sqrt(n/(n-1)) - (adapt_delta - accept_stat)*sqrt(n)/(gamma*(t0+n))

The last term is identical to the adaptation scheme in section 3.2 of the Hoffman and Gelman (2014) NUTS paper. The first term just regularizes the stepsize towards mu. But wait, the multiplier (1-1/(t0+n))*sqrt(n/(n-1)) is larger than 1.0 when n < t0 so for the first ten iterations it pushes the stepsize away from mu rather than pulling towards mu. That can't be right, can it?

nhuurre avatar Apr 16 '21 18:04 nhuurre

Not quite. adapt_next_window_ is not the first iteration of the next window but the last iteration of the current window. When adapt_base_window_ = 0 it's always the last iteration of the init buffer. There is only one reset if init_buffer > 0 and no reset at all if init_buffer = 0. In effect, the term buffer extends to fill the void left by the missing metric adaptation windows.

Technically when adapt_base_window = 0 the next window is always set to the current iteration via adapt_window_counter_

adapt_next_window_ = adapt_window_counter_ + adapt_window_size_ 
                                   = adapt_window_counter_  + 0 
                                   = adapt_window_counter_ ;

but adapt_window_counter_ is incremented after compute_next_window() is called so on the next iteration

adapt_window_counter_ == adapt_next_window_ 
                                        == old_adapt_window_counter_
                                        == adapt_window_counter_ - 1

will evaluate to false and the adaptation will not trigger an update. But regardless the outcome is the same — only one reset immediately after the init_buffer is completed.

The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior

The initial value of x_bar_ does not change the optimization behavior. learn_stepsize() updates x_bar_ according to

x_eta = std::pow(counter_, -kappa_); x_bar_ = ( 1.0 - x_eta) * x_bar_ + x_eta * x; The first update has counter_ = 1. Regardless of the value of kappa_ the weight (1.0 - x_eta) is 0.0 and any finite initial value of x_bar_ is completely erased.

Sorry, yes, you are correct.

The answer is that for the initial window mu_ is set before the first init_stepsize() but for later windows mu_ resets after init_stepsize(). That seems inconsistent. The stepsize is not very good before init_stepsize().

Remember that the step size is adapted to the average acceptance statistic across the entire parameter space. The step size initialization heuristic approximates this average by considering just one step and just the single initial point. If that point is far outside of the typical set, however, the local accept statistic won’t say much about the average acceptance statistic. I believe the current asymmetry is intentional — setting the initial mu based on a user-specified stepsize instead of the behavior around the initial point which may be far outside of the typical set, and then set mu based on the step size initialization heuristic only after the init_buffer has completed and the sampler has a decent shot of reaching the typical set.

• This discussion has been concerned with the "surprising" result when term_buffer = 0. But let's take a quick look at term_buffer = 1. Of course the step size then depends on what the accept_stat__ is for that single adaptation transition but I think that if the accept_stat__ just happens to equal adapt_delta then the "least surprising" result is that the step size does not change. It might not be exactly optimal but there's no clear reason to move it in either direction, right? The current behavior in that scenario is to set the step size to exp(mu_), i.e. 10 times the initial stepsize which feels a bit too large. There's an easy fix: make restart() initialize s_bar_ = gamma*log(10) instead of the current s_bar_ = 0. The most obvious side effect is removing the high spike right after every metric update.

Any average-based adaptation mechanism is going to be a bit unstable after only one iteration.

Primal dual-averaging is set up to shrink the adapted parameter towards mu, with gamma setting the strength of that shrinkage. After 1 iteration where the acceptance statistic is exactly delta defaulting to that shrinkage value is natural response. This appears aggressive only because mu is set to be aggressive; those spikes are the adaptation aggressively trying out larger values of epsilon near mu to see if they’re any better before settling down to a smaller value.

The problem with changing gamma is that it affects the shrinkage towards mu through the entire adaptation sequence, not just after a few iterations. If someone wants to run with a very small terminal buffer then they can set gamma to a very small value so that the shrinkage to mu is much weaker.

But wait, the multiplier (1-1/(t0+n))*sqrt(n/(n-1)) is larger than 1.0 when n < t0 so for the first ten iterations it pushes the stepsize away from mu rather than pulling towards mu. That can't be right, can it?

For n = 0 we have (1 - eta) = (1 - 1 / (t0 + n) ) = 1 - 1 / t0 < 1 because t0 > 0.

For n > 0 we have (1 - eta) = 1 - 1 / (t0 + n) which is also bounded between 0 and 1 because both t0 and n are positive.

betanalpha avatar May 03 '21 18:05 betanalpha