manual longer term
This is the issue for things that would be nice to get into the manual in the future. Rather than closing this issue, we'll just tick things off as part of the regular round of "next manual" issues.
Originally from Jeffrey Arnold here: https://github.com/stan-dev/stan/issues/1081#issuecomment-63598018
Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.
- [ ] reparameterize Gaussian process example to use length scale with a proper prior (e.g.,
half-t(4)).
- [ ] add horseshoe prior example using Allen's implementation:
https://ariddell.org/horseshoe-prior-with-stan.html
- [ ] explore what happens with L1 priors in Stan's MLE to see if we get truly 0 values or close; I'm not sure what L-BFGS will do
- [ ] add section or new chapter about arithmetic precision
- [ ] include following example I sent to stan-users:
For example, this model is problematic arithmetically:
fit <- stan(model_code = "parameters { real y; } model { y ~ normal(1,1e-20); }")
and produces
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
y 1 0.0 0 1 1 1 1 1 4000 Inf
lp__ -15407440 422002.9 26689805 -61629758 -15407440 0 0 0 4000 1.500106e+14
There's really nothing we can do about such models numerically. What you have to do from a user perspective is use a parameterizations z = y - 1, which can be handled stably because you can represent numbers close to zero well.
parameters { real z; } model { z ~ normal(0,1e-20); }
The problem is that there's still no way to represent y = z + 1 as a real value using double-precision floating-point arithmetic, but you can get close to log(y) by using
log_y <- log1p(z);
Provide examples of log-mix for the manual mixture section.
From Jeffrey Arnold:
Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.
From Andrew on stan-dev, an example of a model for the problematic posteriors section:
Next I decided, just for fun, to fit a nonlinear model: y = a_exp(-b_x) + c + error, with N(0,sigma) errors. I simulated some data (x taking on the values between 0 and 23, setting a=.5, b=1/10, c=1, sigma=.2, and I fit it constraining all the parameters to be positive. (In retrospect I’m thinking I should’ve just constrained b and sigma to be positive; constraining a and c made sense in the context of my hypothetical example, but in general it’s poor practice to put in hard constraints unless absolutely necessary.)
I simulated data and fit the model and all was fine. But then I repeated and the fit blew up, I was getting some estimates of b that were either really close to 0 (1e-10, etc) or really high (1e10, etc), I can’t remember which. The point is that there can be instability at the extremes, and I realized the model need a bit of hand-holding. As a kluge I constrained b to be between 1/50 and 1 (which made sense in the context of the example). A hard constraint was ugly but was easy enough to understand. I didn’t want to confuse the class by introducing a normal prior here.
So this second example was a bit of a mess. Which suggests to me that in the manual/book we need to throw in an example to explicitly address instability and constraints. Since, ultimately, this is an inherent difficulty with modeling, it’s not anything to do with Stan in particular.
- [ ] add discussion of K vs. (K-1) coefficient vector parameterizations of multi-logit
Here's what the model should look like, but there's a bug in Stan 2.6 preventing the compilation:
data {
int<lower=2> K; // num categories
int<lower=1> D; // num predictors (without intercept)
int<lower=0> N; // num observations
vector[D] x[N]; // predictors
int<lower=1,upper=K> y[N]; // observations
}
transformed data {
vector[1] zero;
zero[1] <- 0;
}
parameters {
vector[K-1] alpha; // intercepts
matrix[K-1,D] beta; // slopes
}
model {
// priors
alpha ~ normal(0,20);
to_vector(beta) ~ normal(0,10);
// likelihood
for (n in 1:N) {
y[n] ~ categorical_logit(append_row(zero, alpha + beta * x[n]));
}
}
Here's a program that will work, but isn't optimally efficient:
data {
int<lower=2> K; // num categories
int<lower=1> D; // num predictors (without intercept)
int<lower=0> N; // num observations
vector[D] x[N]; // predictors
int<lower=1,upper=K> y[N]; // observations
}
transformed data {
vector[1] zero;
zero[1] <- 0;
}
parameters {
vector[K-1] alpha; // intercepts
matrix[K-1,D] beta; // slopes
}
model {
// temps
vector[K] lambda;
lambda[1] <- 0;
// priors
alpha ~ normal(0,20);
// likelihood
for (n in 1:N) {
for (k in 2:K)
lambda[k] <- alpha[k-1] + beta[k-1] * x[n];
y[n] ~ categorical_logit(lambda);
}
}
- [ ] Add a chapter on using Stan for pure Monte Carlo integration
Computing pi
/**
* Compute the mathematical constant pi by Monte Carlo integration.
*
* First, sample (x,y) uniformly from a square with vertices (-1,1),
* (-1,1), (1,-1), (1,1) by sampling x and y independently from the
* interval (-1,1).
*
* Second, determine if (x,y) lies within the circle circumscribed in
* the square, by testing if (x^2 + y^2 < 1). Then multiply the
* proportion of points inside the circle by the area of the square,
* which is 4.
*
* The posterior mean of the generated quantity pi will compute pi
* to within MC error.
*
* Run with fixed parameter algorithm and zero warmup iterations.
*/
model {
}
generated quantities {
real<lower=-1,upper=1> x;
real<lower=-1,upper=1> y;
real<lower=0,upper=4> pi;
x <- uniform_rng(-1,1);
y <- uniform_rng(-1,1);
pi <- 4 * (x^2 + y^2 < 1);
}
Generalization to volume of hypersphere
/**
* Compute the volume of a hypersphere of a given radius in a
* given number of dimensions by Monte Carlo integration.
*
* The volume will be the mean of the sapled sphere_volume values.
*/
data {
int<lower=1> N;
real<lower=0> r;
}
transformed data {
real cube_volume;
cube_volume <- (2 * r)^N;
}
model {
}
generated quantities {
vector<lower=-1, upper=1>[N] x;
real<lower=0,upper=cube_volume> sphere_volume;
for (n in 1:N)
x[n] <- uniform_rng(-1,1);
sphere_volume <- cube_volume * (dot_self(x) < r);
}
- [ ] add example of WAIC calculation
- [ ] add example of posterior predictive checks
- [ ] add example of pure cross-validation
- [ ] discuss leave-one-out cross-validation and relate to WAIC (see Aki and Andrew's paper)
- [ ] include an example of a spline model from BDA 3 (I still don't understand the description in the book, though, so hopefully someone else can code an example)
- [ ] add Kalman filter examples (again, this is something I'm not that familiar with, but should be doable in Stan)
- [ ] review all the recommendations w.r.t. Cholesky factorization to bring them up to current best practice in terms of computational and statistical efficiency
- [ ] add a discussion of priors for degrees of freedom
From Aki on stan-users:
I recommend as an easy default option nu ~ gamma(2,0.1);
This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified "penalised complexity (PC) prior", which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student's t in Stan.
- [ ] more general discussion of missing data, perhaps following Michael's comment included in the 2.4 manual issue
- [ ] emphasize the role of
n_divergentin the manual (I'd really like to rename this, because n=1 or n=0, so the name seems misleading).
- [ ] add following advice from Andrew
I’d suggest an informative prior, for example if you think the elasticity should be between 0 and 1, you could use a normal prior with mean 0.5 and sd 0.5. The point is that you’d like to let the data speak but at the same time you want to control your estimate. If you do get a negative elasticity estimate, so be it (sometimes such things can happen due to unforeseen interactions in the data) but then you’ll probably also have a big standard error. And if the data really “want” to find a negative elasticity with a small s.e., you’d like to know this—as, at the very least, this will tell you that there’s something weird about the data. Better to let the data speak than to get an estimate right on the boundary which still doesn’t fit the data.
Originally issue #1431 (from Titus van der Malsburg):
The term local variable is used a lot in the manual but it is difficult to extract a complete definition. For example, section 25.7 says:
At the beginning of each block, local variables may be declared that are scoped over the rest of the statements in the block.
However, since all variables are defined at the beginning of a block, that would mean that all variables are local, which is clearly not the case. On the other hand, the section variable scope explains that variables declared in one program block are visible in subsequent blocks except for variables declared in the model block. This section, however, fails to mention that this is not the case for variables declared in nested blocks.
I think it would help if there was one place in the manual that gives the whole story. It may also make sense to use this opportunity to clarify why the term global variable isn't used (only variables in the data block are global).
General discussion of efficiency of restart so users don't feel like they're losing so much time.
Everyone keeps asking about whether Stan can just "keep going" and do more sampling.
You absolutely cannot just keep going. Nor do you want to. The problem is that if you haven't reached convergence, then you want to do more warmup, and then more sampling. So you'd have to toss out the sampling part of the run anyway.
If you run 1000 warmup iterations and 1000 sampling iterations and haven't converged, then you want to go back and run 2000 warmup iterations and 2000 sampling iterations. If you were to restart, after the original warmup iterations, you'd save around 15% of your work. Here's an example with the default number of iterations (which is probably way too high):
Rerun from beginning, doubling sizes
1000 + 1000
2000 + 2000
-----------
3000 + 3000
Restart warmup (warmup hasn't converged):
1000 + 1000
0 + 1000 + 2000
-----------
2000 + 3000
The exception to this is when you have converged during warmup, but just haven't drawn enough iterations for the n_eff you need. In that case, you'd get the following.
Resample (sorry for header --- GitHub flavored markdown glitch in action)
1000 + 1000
0 + 1000
-----------
3000
Restart sampling (warmup converged)
1000 + 1000
1000 + 2000
-----------
5000
So you'd save about 40% of your total effort if you could just run more samples.
- [ ] Add least-absolute-value regression --- just needs DoubleExponential noise distribution
Andre Pfeuffer asked about this on stan-users:
if (theta < -5 || theta > 5)
increment_log_prob(-1.0e10);
I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.
I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.
Andre Pfeuffer asked about this on stan-users: ``` if (theta 5) increment_log_prob(-1.0e10); ```I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.
What I first said was wrong.
I was right in that it won't affect the gradient, but was wrong in that it will affect the posterior by changing the slice sampling in NUTS (or the Metropolis step in HMC). It'll have a similarly indirect affect on optimization having to do with termination criteria.
As to why it doesn't affect the gradients, the C++ generated looks identical to the Stan code. So there's nothing connecting the log probability accumulator back to theta. It's used for the conditional, but not tied into the result.
So if you think of the expression tree, when I do this
increment_log_prob(-sigma); increment_log_prob(-sqaure((y - mu) / sigma));
you get an expression that looks like:
lp == 0 - sigma - square((y - mu) / sigma);
So the derivative w.r.t. sigma and y and mu is defined.
But if you
if (theta < -5 || theta > 5) increment_log_prob(-1e10);
then you have
lp == 0 - 1e10
or
lp == 0
and there's no connection back to the parameters. So no gradient.
- Bob
On Jul 6, 2015, at 2:12 PM, maverickg [email protected] wrote:
I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.
Andre Pfeuffer asked about this on stan-users:
if (theta < -5 || theta > 5) increment_log_prob(-1.0e10);I wrote back Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.— Reply to this email directly or view it on GitHub.
Add the advice from Ben Goodrich and Krzysztof Sakrejda on stan-users in response to a query from Jonathan Gilligan:
Jonathan:
There have been some great discussions here about how important it is to use pairs() plots to check for sampling problems (bottlenecks, correlations, etc.), but I would love advice about how to work with big multilevel models where there are a small number of top-level hyperparameters characterizing hyperpriors, but at lower levels have one or more priors (each with one or more parameters) for each of a large number of groups, so pairs() plots of all combinations of parameters would require too many panes to be useful or practical.
Ben:
I would say to always include lp__, top-level hyperparameters, and any variance / standard deviation that goes into the likelihood. If that does not point to the source of your problem, then maybe do additional pairs plots with lp__ and a batch of lower-level parameters. High correlations are not a big deal, per se, but can be problematic when combined with changing variances.
Krzysztof
When choosing which lower-level parameters to plot it helps to divide them into groups based on the amount of data available.
Following Ben's advice on stan-users:
For scale parameters, I do
- Jeffreys if I know the units of measurement don't matter
- rescaled Exponential if I know the mean
- rescaled Gamma, Generalized Gamma, or Amoroso if I know more than the mean (which is rare)
- [ ] add this advice for scale parameters (assuming Andrew doesn't object)
- [ ] add chapter on reference priors (Jeffrey's priors; ditto)
- [ ] add chapter on arithmetic precision
- [ ] add discussion of cdfs and ccdfs and rngs in general to the chapter that includes vectorization and rename it
from #1564:
- [ ] add arguments to sampling statements
From Krzysztof Sakrejda on stan-users:
... slice sampling and MH don't make sense since we have HMC but ADVI and optimization have different enough properties that they're worth including. I just looked at the Stan manual and it doesn't really address the choice of algorithms in any clear section (maybe it's in the intro in story form) so maybe right before section 1.5 which launches into talking about sampling there could be a little section about choice of algorithms.- [ ] add an overview of various algorithms
Comment from @bgoodri on stan-users:
There are many ways of defining a partial correlation. Some include
- The correlation between i and j given all other variables. This is given by inverting the covariance matrix, rescaling it to a correlation matrix, and negating the off-diagonal elements
- The correlation between i and j given all variables (if any) before i where i < j. These are given by the transformation from a correlation matrix to unconstrained parameters if you do a Fisher transform of the unconstrained parameters. You can do this in R with the unconstrained_pars function.
- The correlation between i and j given all variables (if any) between i and j where i < j. This is not available in Stan but there is code for it in the clusterGeneration R package.
- The correlation between i and j given all variables (if any) after j where i < j. These partial correlations do not have a name as far as I know of and no code either.