brms
brms copied to clipboard
possibly inefficient stan code generation for group-level coefficients for factors with many levels
Hi there, I attempted to use brms to run a simple model as follows:
fit1.brms <- brm( formula= lval ~ -1 + gridfac + (gridfac | office) ,
data = data.brms,
family = gaussian(),
warmup = 1000,
iter=2000,
chains=4,
cores=4)
Here gridfac is a factor with variable with 204 levels, office a factor variable with 63 levels, and there are about 800,000 observations. OK, I know it's a very large correlation structure but ...
The stan code generated by brms refers to 204 data vectors Z_1_n (n=1 to 204), presumably model matrix columns with indicator variables for the gridfac levels, and in the model statement
includes the line
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + ... + r_1_204[J_1[n]] * Z_1_204[n];
This model runs extremely slowly and ultimately gives out-of-memory errors.
A theoretically identical model coded by hand uses indexing rather than a model matrix, and is as follows:
data {
int<lower=0> N; // number of data points
int<lower=0> N_offices; // number of offices
int<lower=0> N_gridfac; // number of gridfacs
vector[N] Y; //
int<lower=1,upper=N_offices> office[N]; // office index
int<lower=1,upper=N_gridfac> gridfac[N]; // gridfac index
}
parameters {
vector[N_gridfac] b; // population-level effects
real<lower=0> sigma; // residual SD
vector<lower=0>[N_gridfac] sd_1; // group-level standard deviations
matrix[N_gridfac, N_offices] z_1; // standardized group-level effects
// cholesky factor of correlation matrix
cholesky_factor_corr[N_gridfac] L_1;
}
transformed parameters {
// actual group-level effects
matrix[N_gridfac, N_offices] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1); //
}
model {
// initialize linear predictor term
vector[N] mu = b[gridfac];
for (n in 1:N) {
// add more terms to the linear predictor
mu[n] += r_1[gridfac[n],office[n]];
}
// priors including all constants
target += student_t_lpdf(sigma | 3, 0, 10)
- 1 * student_t_lccdf(0 | 3, 0, 10);
target += student_t_lpdf(sd_1 | 3, 0, 10)
- N_gridfac * student_t_lccdf(0 | 3, 0, 10);
target += normal_lpdf(to_vector(z_1) | 0, 1);
target += lkj_corr_cholesky_lpdf(L_1 | 1);
// likelihood including all constants
target += normal_lpdf(Y | mu, sigma);
// put very wide priors on b to speed things up
b ~ normal(0,10);
}
With the exception of the prior on b, this is the same model as the auto-generated code, but it runs much faster and with minimal memory use. However, I did remove the generated quantities block, which might improve speed and memory use somewhat.
I would think that while the memory requirement for the extra Z_1_n variables is large, it would not crash my machine (64 gigs of RAM). However I might guess that the big sum mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + ... + r_1_204[J_1[n]] * Z_1_204[n] generates a large amount of unnecessary work for stan's autodifferentiation engine (most terms are identically zero) -- but this is just a guess.
Is this an opportunity to re-engineer the code generator's handling of factors, or is this kind of thing unavoidable for other reasons of generality? (Or are my speed increases phantoms caused by the prior and dropping generated quantities? I can do some more tests if that's a possibility.)
You are right that for this special example, the brms code is inefficient. However, the brms code needs to be general enough to work for all possibile varying effects, that is, also for continuous predictors and multiple predictors. The solution I use in brms is to my current knowledge the most efficient general solution. If you have a proposal for a more efficient but still general solution I would be very happy to hear it!
It may be theoretically possible to detect such a special case and adjust the Stan code accordingly, but this adds to the already high maintenence burden of brms and will take some time to implement reliably which I currently don't have. Perhaps with the implementation of efficient sparse matrices in Stan, we will find a way to make this more efficient but still general. Let's see how that goes. I leave this issue open for now in case someone has any ideas or wants to jump in to implement this special case.
Side note: Be aware that brms uses dummy coding by default for gridfac, so you should write lval ~ -1 + gridfac + (-1 + gridfac | office) for exact equivalence of the code.
Thanks Paul, and thanks for your work on this very cool package. If I have time I will take a look at the code and see if a simple approach via indexing is something I might be able to implement. Because I see that you opened another issue regarding the big sum and multiple indexing; is it your sense that the inefficiency in this case is caused by the memory footprint of the extra vectors or by the autodifferentiation of the big sum (presumably not addressed by using a dot product)?
It is indeed because of a lot of unnecessary + and * operations and associated autodiff computations so nothing to be fixed by dot products or similar, but only by incorporating specific knowledge about the data into the model structure.
Closing this issue for now to reduce the load of the issue tracker. If someone has time and wants to work on this, feel free to write here and I am happy to reopen.