rstanarm
rstanarm copied to clipboard
Do weights work in `stan_glm()`?
Summary:
I'm trying to use the weights
argument to stan_glm()
and stan_glmer()
but it does not appear to affect the fit.
The same data in lm()
or glm()
works as desired.
Reproducible Steps:
Load the attached data weights_test.txt:
df <- read_csv("weights_test.txt")
head(df)
x | y |
---|---|
1 | 0.0 |
1 | 0.0 |
0 | 387.0 |
0 | 309.6 |
0 | 0.0 |
0 | 154.8 |
Then fit the model to all the data:
> stan_glm(y ~ x, data=df, iter=500)
Median MAD_SD
(Intercept) 95.7 14.0
x 38.5 20.0
Now create a weighted dataset:
df_weighted <- df %>%
group_by(x,y) %>%
summarize(n=n())
head(df_weighted)
x | y | n |
---|---|---|
0 | 0.00 | 435 |
0 | 25.80 | 3 |
0 | 30.96 | 2 |
0 | 51.59 | 4 |
0 | 77.39 | 5 |
0 | 103.19 | 16 |
Then fit the weighted data:
>stan_glm(y ~ x, data=df_weighted, weights=n, iter=500)
Median MAD_SD
(Intercept) 779.2 167.0
x -72.7 228.0
This is very similar to what I get with df_weighted
but not specifying any weights
. On the other hand using lm()
or glm()
, the weighted fit is the same as the original fit.
RStanARM Version:
2.19.2
R Version:
3.6.1
Operating System:
Debian Linux (Stretch)
I think that is what is supposed to happen.
Huh, can you say more? Maybe I wasn't clear. The weighted dataset only contains the unique observations x,y
and an additional n
indicating the number of such observations in the original dataset.
So I think weighting unique observations should lead to the same likelihood as the original data (perhaps the priors are different, not sure if weighting is taken into account in the default priors) and so the same results? I'm getting really different results. With lm()
the fit to weighted unique observations and fit to original data are identical.
Another hint to me that it's not working is that removing the weights from
stan_glm(y ~ x, data=df_weighted, weights=n, iter=500)
to
stan_glm(y ~ x, data=df_weighted, iter=500)
gives essentially equivalent results
The Stan code distinguishes between the has_weights=1
and has_weights=0
cases, as in
https://github.com/stan-dev/rstanarm/blob/master/src/stan_files/continuous.stan#L203
So, I would expect a dataset that has been collapsed to the unique values but has a count for the number of times that row occurred, to give the same results as the uncollapsed dataset.
That's my expectation as well but it's not the result I'm getting
Above df
is uncollapsed, df_weighted
is collapsed
@bgoodri when you get a chance can you take a closer look? I'm pretty sure that the behavior you are saying should happen is not happening. Thanks
I will but I'm not sure when. Both rstan and Columbia are in a state of flux ATM.
On Mon, Mar 9, 2020 at 2:46 PM Eric Potash [email protected] wrote:
@bgoodri https://github.com/bgoodri when you get a chance can you take a closer look? I'm pretty sure that the behavior you are saying should happen is not happening. Thanks
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstanarm/issues/422?email_source=notifications&email_token=AAZ2XKSRH4Q75QHRRKUTI5DRGVBRLA5CNFSM4K66AJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOISWDA#issuecomment-596716300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2XKVZMMZLTNQNYLJTBQTRGVBRLANCNFSM4K66AJ6Q .
Got it thanks for the update. Hope the flux gets stabilized
Btw I added a fake group to the data and can confirm that weights in stan_glmer()
have the desired result, i.e. uncollapsed and collapsed weighted fits agree, collapsed unweighted fit is v different
I too have found that weights did not change my estimates in stan_glm
though the weights in glm
changed my estimates substantially. I tried looking at the Stan file written by stan_glm
and here are the relevant lines:
54-55:
// declares has_weights, weights, has_offset, offset
#include /data/weights_offset.stan
I'm not sure how to access weights_offset.stan.
246-255:
else { // weighted log-likelihoods
vector[N] summands;
if (family == 1) summands = pw_gauss(y, eta, aux, link);
else if (family == 2) summands = pw_gamma(y, eta, aux, link);
else if (family == 3) summands = pw_inv_gaussian(y, eta, aux, link, log_y, sqrt_y);
else if (family == 4 && link_phi == 0) summands = pw_beta(y, eta, aux, link);
else if (family == 4 && link_phi > 0) summands = pw_beta_z(y, eta, eta_z, link, link_phi);
target += dot_product(weights, summands);
}
}
So if weights are specified the there is a dot product with the summands, but it isn't clear where that is assigned to and the weights aren't assigned again. All of the unweighted log-likelihoods (lines 203-246, not shown here) have a form of target += distribution_lpdf(y | ...)
. Maybe this is missing that bit?
I think the weights still aren't correct, at least for the Gaussian case. Ignoring the link and calling $w$ the weights vector, the weighted likelihood adds to target
:
summands = -0.5 * log(6.283185307179586232 * sigma) - 0.5 * square((y - eta) / sigma);
dot_product(weights, summands);
or:
$$w^\top \left(-\frac{1}{2}\log2\pi\sigma - \frac{1}{2}\left(\frac{y - \eta}{\sigma}\right)^2\right)$$
If $Y_i \sim N(\eta_i, \sigma / \sqrt{w_i})$, we would expect:
$$\sum_{i=1}^N \left(-\frac{1}{2}\log2\pi\frac{\sigma^2}{w_i} -\frac{1}{2}w_i\left(\frac{y_i - \eta_i}{\sigma}\right)^2\right)$$
or vectorized:
$$-\frac{N}{2}\log2\pi\sigma^2 + \frac{1}{2}\sum_{i=1}^N\log w_i - \frac{1}{2}w^\top\left(\frac{y - \eta}{\sigma}\right)^2$$
Edit:
Or, if you just want to multiply the log-likelihood, put $\sigma^2$ under the square root.