docs icon indicating copy to clipboard operation
docs copied to clipboard

manual longer term

Open bob-carpenter opened this issue 10 years ago • 47 comments

This is the issue for things that would be nice to get into the manual in the future. Rather than closing this issue, we'll just tick things off as part of the regular round of "next manual" issues.

bob-carpenter avatar Jan 22 '15 20:01 bob-carpenter

Originally from Jeffrey Arnold here: https://github.com/stan-dev/stan/issues/1081#issuecomment-63598018

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

bob-carpenter avatar Jan 22 '15 20:01 bob-carpenter

  • [ ] reparameterize Gaussian process example to use length scale with a proper prior (e.g., half-t(4)).

bob-carpenter avatar Jan 22 '15 20:01 bob-carpenter

  • [ ] add horseshoe prior example using Allen's implementation:

https://ariddell.org/horseshoe-prior-with-stan.html

  • [ ] explore what happens with L1 priors in Stan's MLE to see if we get truly 0 values or close; I'm not sure what L-BFGS will do

bob-carpenter avatar Jan 22 '15 20:01 bob-carpenter

  • [ ] add section or new chapter about arithmetic precision
  • [ ] include following example I sent to stan-users:

For example, this model is problematic arithmetically:

fit <- stan(model_code = "parameters { real y; } model { y ~ normal(1,1e-20); }")

and produces

          mean  se_mean       sd      2.5%       25% 50% 75% 97.5% n_eff         Rhat
y            1      0.0        0         1         1   1   1     1  4000          Inf
lp__ -15407440 422002.9 26689805 -61629758 -15407440   0   0     0  4000 1.500106e+14

There's really nothing we can do about such models numerically. What you have to do from a user perspective is use a parameterizations z = y - 1, which can be handled stably because you can represent numbers close to zero well.

parameters { real z; } model { z ~ normal(0,1e-20); }

The problem is that there's still no way to represent y = z + 1 as a real value using double-precision floating-point arithmetic, but you can get close to log(y) by using

  log_y <- log1p(z);

bob-carpenter avatar Jan 22 '15 21:01 bob-carpenter

Provide examples of log-mix for the manual mixture section.

bob-carpenter avatar Feb 04 '15 21:02 bob-carpenter

From Jeffrey Arnold:

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

bob-carpenter avatar Feb 04 '15 21:02 bob-carpenter

From Andrew on stan-dev, an example of a model for the problematic posteriors section:

Next I decided, just for fun, to fit a nonlinear model: y = a_exp(-b_x) + c + error, with N(0,sigma) errors. I simulated some data (x taking on the values between 0 and 23, setting a=.5, b=1/10, c=1, sigma=.2, and I fit it constraining all the parameters to be positive. (In retrospect I’m thinking I should’ve just constrained b and sigma to be positive; constraining a and c made sense in the context of my hypothetical example, but in general it’s poor practice to put in hard constraints unless absolutely necessary.)

I simulated data and fit the model and all was fine. But then I repeated and the fit blew up, I was getting some estimates of b that were either really close to 0 (1e-10, etc) or really high (1e10, etc), I can’t remember which. The point is that there can be instability at the extremes, and I realized the model need a bit of hand-holding. As a kluge I constrained b to be between 1/50 and 1 (which made sense in the context of the example). A hard constraint was ugly but was easy enough to understand. I didn’t want to confuse the class by introducing a normal prior here.

So this second example was a bit of a mess. Which suggests to me that in the manual/book we need to throw in an example to explicitly address instability and constraints. Since, ultimately, this is an inherent difficulty with modeling, it’s not anything to do with Stan in particular.

bob-carpenter avatar Feb 09 '15 17:02 bob-carpenter

  • [ ] add discussion of K vs. (K-1) coefficient vector parameterizations of multi-logit

Here's what the model should look like, but there's a bug in Stan 2.6 preventing the compilation:

data {
  int<lower=2> K;             // num categories
  int<lower=1> D;             // num predictors (without intercept)
  int<lower=0> N;             // num observations
  vector[D] x[N];              // predictors
  int<lower=1,upper=K> y[N];  // observations
}
transformed data {
  vector[1] zero;
  zero[1] <- 0;
}
parameters {
  vector[K-1] alpha;          // intercepts
  matrix[K-1,D] beta;         // slopes
}
model {
  // priors
  alpha ~ normal(0,20);
  to_vector(beta) ~ normal(0,10);
  // likelihood
  for (n in 1:N) {
    y[n] ~ categorical_logit(append_row(zero, alpha + beta * x[n]));
  }  
}

Here's a program that will work, but isn't optimally efficient:

data {
 int<lower=2> K;             // num categories
 int<lower=1> D;             // num predictors (without intercept)
 int<lower=0> N;             // num observations
 vector[D] x[N];              // predictors
 int<lower=1,upper=K> y[N];  // observations
}
transformed data {
 vector[1] zero;
 zero[1] <- 0;
}
parameters {
 vector[K-1] alpha;          // intercepts
 matrix[K-1,D] beta;         // slopes
}
model {
 // temps
 vector[K] lambda;
 lambda[1] <- 0;

 // priors
 alpha ~ normal(0,20);
 // likelihood
 for (n in 1:N) {
   for (k in 2:K)
     lambda[k] <- alpha[k-1] + beta[k-1] * x[n];
   y[n] ~ categorical_logit(lambda);
 }
}

bob-carpenter avatar Feb 14 '15 07:02 bob-carpenter

  • [ ] Add a chapter on using Stan for pure Monte Carlo integration
Computing pi
/**
 * Compute the mathematical constant pi by Monte Carlo integration.
 *
 * First, sample (x,y) uniformly from a square with vertices (-1,1),
 * (-1,1), (1,-1), (1,1) by sampling x and y independently from the
 * interval (-1,1).  
 * 
 * Second, determine if (x,y) lies within the circle circumscribed in
 * the square, by testing if (x^2 + y^2 < 1). Then multiply the
 * proportion of points inside the circle by the area of the square,
 * which is 4.
 *
 * The posterior mean of the generated quantity pi will compute pi
 * to within MC error.
 *
 * Run with fixed parameter algorithm and zero warmup iterations.
 */
model {
}
generated quantities {
  real<lower=-1,upper=1> x;
  real<lower=-1,upper=1> y;
  real<lower=0,upper=4> pi;
  x <- uniform_rng(-1,1);
  y <- uniform_rng(-1,1);
  pi <- 4 * (x^2 + y^2 < 1);
}
Generalization to volume of hypersphere
/**
 * Compute the volume of a hypersphere of a given radius in a
 * given number of dimensions by Monte Carlo integration.
 *
 * The volume will be the mean of the sapled sphere_volume values.
 */
data {
  int<lower=1> N;
  real<lower=0> r;
}
transformed data {
  real cube_volume;
  cube_volume <- (2 * r)^N;
}
model {
}
generated quantities {
  vector<lower=-1, upper=1>[N] x;
  real<lower=0,upper=cube_volume> sphere_volume;

  for (n in 1:N)
    x[n] <- uniform_rng(-1,1);

  sphere_volume <- cube_volume * (dot_self(x) < r);
}

bob-carpenter avatar Feb 14 '15 15:02 bob-carpenter

  • [ ] add example of WAIC calculation

bob-carpenter avatar Feb 17 '15 03:02 bob-carpenter

  • [ ] add example of posterior predictive checks

bob-carpenter avatar Feb 17 '15 03:02 bob-carpenter

  • [ ] add example of pure cross-validation
  • [ ] discuss leave-one-out cross-validation and relate to WAIC (see Aki and Andrew's paper)

bob-carpenter avatar Feb 17 '15 03:02 bob-carpenter

  • [ ] include an example of a spline model from BDA 3 (I still don't understand the description in the book, though, so hopefully someone else can code an example)
  • [ ] add Kalman filter examples (again, this is something I'm not that familiar with, but should be doable in Stan)

bob-carpenter avatar Mar 06 '15 06:03 bob-carpenter

  • [ ] review all the recommendations w.r.t. Cholesky factorization to bring them up to current best practice in terms of computational and statistical efficiency

bob-carpenter avatar Mar 06 '15 06:03 bob-carpenter

  • [ ] add a discussion of priors for degrees of freedom

From Aki on stan-users:

I recommend as an easy default option nu ~ gamma(2,0.1);

This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified "penalised complexity (PC) prior", which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student's t in Stan.

bob-carpenter avatar Mar 06 '15 09:03 bob-carpenter

  • [ ] more general discussion of missing data, perhaps following Michael's comment included in the 2.4 manual issue

bob-carpenter avatar Mar 06 '15 09:03 bob-carpenter

  • [ ] emphasize the role of n_divergent in the manual (I'd really like to rename this, because n=1 or n=0, so the name seems misleading).

bob-carpenter avatar Mar 07 '15 13:03 bob-carpenter

  • [ ] add following advice from Andrew

I’d suggest an informative prior, for example if you think the elasticity should be between 0 and 1, you could use a normal prior with mean 0.5 and sd 0.5. The point is that you’d like to let the data speak but at the same time you want to control your estimate. If you do get a negative elasticity estimate, so be it (sometimes such things can happen due to unforeseen interactions in the data) but then you’ll probably also have a big standard error. And if the data really “want” to find a negative elasticity with a small s.e., you’d like to know this—as, at the very least, this will tell you that there’s something weird about the data. Better to let the data speak than to get an estimate right on the boundary which still doesn’t fit the data.

bob-carpenter avatar Apr 24 '15 04:04 bob-carpenter

Originally issue #1431 (from Titus van der Malsburg):

The term local variable is used a lot in the manual but it is difficult to extract a complete definition. For example, section 25.7 says:

At the beginning of each block, local variables may be declared that are scoped over the rest of the statements in the block.

However, since all variables are defined at the beginning of a block, that would mean that all variables are local, which is clearly not the case. On the other hand, the section variable scope explains that variables declared in one program block are visible in subsequent blocks except for variables declared in the model block. This section, however, fails to mention that this is not the case for variables declared in nested blocks.

I think it would help if there was one place in the manual that gives the whole story. It may also make sense to use this opportunity to clarify why the term global variable isn't used (only variables in the data block are global).

bob-carpenter avatar Apr 29 '15 05:04 bob-carpenter

General discussion of efficiency of restart so users don't feel like they're losing so much time.

Everyone keeps asking about whether Stan can just "keep going" and do more sampling.

You absolutely cannot just keep going. Nor do you want to. The problem is that if you haven't reached convergence, then you want to do more warmup, and then more sampling. So you'd have to toss out the sampling part of the run anyway.

If you run 1000 warmup iterations and 1000 sampling iterations and haven't converged, then you want to go back and run 2000 warmup iterations and 2000 sampling iterations. If you were to restart, after the original warmup iterations, you'd save around 15% of your work. Here's an example with the default number of iterations (which is probably way too high):

Rerun from beginning, doubling sizes

   1000 + 1000
   2000 + 2000
   -----------
   3000 + 3000

Restart warmup (warmup hasn't converged):

   1000     + 1000
   0 + 1000 + 2000
   -----------
   2000 + 3000

The exception to this is when you have converged during warmup, but just haven't drawn enough iterations for the n_eff you need. In that case, you'd get the following.

Resample (sorry for header --- GitHub flavored markdown glitch in action)

   1000 + 1000
      0 + 1000
   -----------
   3000

Restart sampling (warmup converged)

   1000 + 1000
   1000 + 2000
   -----------
   5000

So you'd save about 40% of your total effort if you could just run more samples.

bob-carpenter avatar Jun 16 '15 20:06 bob-carpenter

  • [ ] Add least-absolute-value regression --- just needs DoubleExponential noise distribution

bob-carpenter avatar Jun 26 '15 17:06 bob-carpenter

Andre Pfeuffer asked about this on stan-users:

if (theta < -5 || theta > 5)
  increment_log_prob(-1.0e10);

I wrote back

Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

bob-carpenter avatar Jul 06 '15 16:07 bob-carpenter

I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.

Andre Pfeuffer asked about this on stan-users: ``` if (theta 5) increment_log_prob(-1.0e10); ```

I wrote back

Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

maverickg avatar Jul 06 '15 18:07 maverickg

What I first said was wrong.

I was right in that it won't affect the gradient, but was wrong in that it will affect the posterior by changing the slice sampling in NUTS (or the Metropolis step in HMC). It'll have a similarly indirect affect on optimization having to do with termination criteria.

As to why it doesn't affect the gradients, the C++ generated looks identical to the Stan code. So there's nothing connecting the log probability accumulator back to theta. It's used for the conditional, but not tied into the result.

So if you think of the expression tree, when I do this

increment_log_prob(-sigma); increment_log_prob(-sqaure((y - mu) / sigma));

you get an expression that looks like:

lp == 0 - sigma - square((y - mu) / sigma);

So the derivative w.r.t. sigma and y and mu is defined.

But if you

if (theta < -5 || theta > 5) increment_log_prob(-1e10);

then you have

lp == 0 - 1e10

or

lp == 0

and there's no connection back to the parameters. So no gradient.

  • Bob

On Jul 6, 2015, at 2:12 PM, maverickg [email protected] wrote:

I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.

Andre Pfeuffer asked about this on stan-users: if (theta < -5 || theta > 5) increment_log_prob(-1.0e10); I wrote back Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

— Reply to this email directly or view it on GitHub.

bob-carpenter avatar Jul 06 '15 18:07 bob-carpenter

Add the advice from Ben Goodrich and Krzysztof Sakrejda on stan-users in response to a query from Jonathan Gilligan:

Jonathan:

There have been some great discussions here about how important it is to use pairs() plots to check for sampling problems (bottlenecks, correlations, etc.), but I would love advice about how to work with big multilevel models where there are a small number of top-level hyperparameters characterizing hyperpriors, but at lower levels have one or more priors (each with one or more parameters) for each of a large number of groups, so pairs() plots of all combinations of parameters would require too many panes to be useful or practical.

Ben:

I would say to always include lp__, top-level hyperparameters, and any variance / standard deviation that goes into the likelihood. If that does not point to the source of your problem, then maybe do additional pairs plots with lp__ and a batch of lower-level parameters. High correlations are not a big deal, per se, but can be problematic when combined with changing variances.

Krzysztof

When choosing which lower-level parameters to plot it helps to divide them into groups based on the amount of data available.

bob-carpenter avatar Jul 22 '15 18:07 bob-carpenter

Following Ben's advice on stan-users:

For scale parameters, I do

  • Jeffreys if I know the units of measurement don't matter
  • rescaled Exponential if I know the mean
  • rescaled Gamma, Generalized Gamma, or Amoroso if I know more than the mean (which is rare)
  • [ ] add this advice for scale parameters (assuming Andrew doesn't object)
  • [ ] add chapter on reference priors (Jeffrey's priors; ditto)

bob-carpenter avatar Jul 23 '15 20:07 bob-carpenter

  • [ ] add chapter on arithmetic precision
  • [ ] add discussion of cdfs and ccdfs and rngs in general to the chapter that includes vectorization and rename it

bob-carpenter avatar Aug 11 '15 17:08 bob-carpenter

from #1564:

  • [ ] add arguments to sampling statements

bob-carpenter avatar Aug 14 '15 21:08 bob-carpenter

From Krzysztof Sakrejda on stan-users:

... slice sampling and MH don't make sense since we have HMC but ADVI and optimization have different enough properties that they're worth including. I just looked at the Stan manual and it doesn't really address the choice of algorithms in any clear section (maybe it's in the intro in story form) so maybe right before section 1.5 which launches into talking about sampling there could be a little section about choice of algorithms.
- [ ] add an overview of various algorithms

bob-carpenter avatar Dec 28 '15 20:12 bob-carpenter

Comment from @bgoodri on stan-users:

There are many ways of defining a partial correlation. Some include

  • The correlation between i and j given all other variables. This is given by inverting the covariance matrix, rescaling it to a correlation matrix, and negating the off-diagonal elements
  • The correlation between i and j given all variables (if any) before i where i < j. These are given by the transformation from a correlation matrix to unconstrained parameters if you do a Fisher transform of the unconstrained parameters. You can do this in R with the unconstrained_pars function.
  • The correlation between i and j given all variables (if any) between i and j where i < j. This is not available in Stan but there is code for it in the clusterGeneration R package.
  • The correlation between i and j given all variables (if any) after j where i < j. These partial correlations do not have a name as far as I know of and no code either.

bob-carpenter avatar Mar 07 '16 20:03 bob-carpenter