rstanarm icon indicating copy to clipboard operation
rstanarm copied to clipboard

Do weights work in `stan_glm()`?

Open potash opened this issue 5 years ago • 12 comments

Summary:

I'm trying to use the weights argument to stan_glm() and stan_glmer() but it does not appear to affect the fit.

The same data in lm() or glm() works as desired.

Reproducible Steps:

Load the attached data weights_test.txt:

df <- read_csv("weights_test.txt")
head(df)
x y
1 0.0
1 0.0
0 387.0
0 309.6
0 0.0
0 154.8

Then fit the model to all the data:

> stan_glm(y ~ x, data=df, iter=500)
            Median MAD_SD
(Intercept) 95.7   14.0  
x           38.5   20.0  

Now create a weighted dataset:

df_weighted <- df %>%
  group_by(x,y) %>%
  summarize(n=n())
head(df_weighted)
x y n
0 0.00 435
0 25.80 3
0 30.96 2
0 51.59 4
0 77.39 5
0 103.19 16

Then fit the weighted data:

>stan_glm(y ~ x, data=df_weighted, weights=n, iter=500)
            Median MAD_SD
(Intercept) 779.2  167.0 
x           -72.7  228.0 

This is very similar to what I get with df_weighted but not specifying any weights. On the other hand using lm() or glm(), the weighted fit is the same as the original fit.

RStanARM Version:

2.19.2

R Version:

3.6.1

Operating System:

Debian Linux (Stretch)

potash avatar Feb 29 '20 20:02 potash

I think that is what is supposed to happen.

bgoodri avatar Mar 03 '20 19:03 bgoodri

Huh, can you say more? Maybe I wasn't clear. The weighted dataset only contains the unique observations x,y and an additional n indicating the number of such observations in the original dataset.

So I think weighting unique observations should lead to the same likelihood as the original data (perhaps the priors are different, not sure if weighting is taken into account in the default priors) and so the same results? I'm getting really different results. With lm() the fit to weighted unique observations and fit to original data are identical.

potash avatar Mar 03 '20 21:03 potash

Another hint to me that it's not working is that removing the weights from

stan_glm(y ~ x, data=df_weighted, weights=n, iter=500)

to

stan_glm(y ~ x, data=df_weighted, iter=500)

gives essentially equivalent results

potash avatar Mar 03 '20 21:03 potash

The Stan code distinguishes between the has_weights=1 and has_weights=0 cases, as in

https://github.com/stan-dev/rstanarm/blob/master/src/stan_files/continuous.stan#L203

So, I would expect a dataset that has been collapsed to the unique values but has a count for the number of times that row occurred, to give the same results as the uncollapsed dataset.

bgoodri avatar Mar 04 '20 00:03 bgoodri

That's my expectation as well but it's not the result I'm getting

potash avatar Mar 04 '20 00:03 potash

Above df is uncollapsed, df_weighted is collapsed

potash avatar Mar 04 '20 00:03 potash

@bgoodri when you get a chance can you take a closer look? I'm pretty sure that the behavior you are saying should happen is not happening. Thanks

potash avatar Mar 09 '20 18:03 potash

I will but I'm not sure when. Both rstan and Columbia are in a state of flux ATM.

On Mon, Mar 9, 2020 at 2:46 PM Eric Potash [email protected] wrote:

@bgoodri https://github.com/bgoodri when you get a chance can you take a closer look? I'm pretty sure that the behavior you are saying should happen is not happening. Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstanarm/issues/422?email_source=notifications&email_token=AAZ2XKSRH4Q75QHRRKUTI5DRGVBRLA5CNFSM4K66AJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOISWDA#issuecomment-596716300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2XKVZMMZLTNQNYLJTBQTRGVBRLANCNFSM4K66AJ6Q .

bgoodri avatar Mar 09 '20 18:03 bgoodri

Got it thanks for the update. Hope the flux gets stabilized

potash avatar Mar 09 '20 19:03 potash

Btw I added a fake group to the data and can confirm that weights in stan_glmer() have the desired result, i.e. uncollapsed and collapsed weighted fits agree, collapsed unweighted fit is v different

potash avatar Mar 09 '20 21:03 potash

I too have found that weights did not change my estimates in stan_glm though the weights in glm changed my estimates substantially. I tried looking at the Stan file written by stan_glm and here are the relevant lines: 54-55:

// declares has_weights, weights, has_offset, offset
#include /data/weights_offset.stan

I'm not sure how to access weights_offset.stan.

246-255:

    else { // weighted log-likelihoods
      vector[N] summands;
      if (family == 1) summands = pw_gauss(y, eta, aux, link);
      else if (family == 2) summands = pw_gamma(y, eta, aux, link);
      else if (family == 3) summands = pw_inv_gaussian(y, eta, aux, link, log_y, sqrt_y);
      else if (family == 4 && link_phi == 0) summands = pw_beta(y, eta, aux, link);
      else if (family == 4 && link_phi > 0) summands = pw_beta_z(y, eta, eta_z, link, link_phi);
      target += dot_product(weights, summands);
    }
  }

So if weights are specified the there is a dot product with the summands, but it isn't clear where that is assigned to and the weights aren't assigned again. All of the unweighted log-likelihoods (lines 203-246, not shown here) have a form of target += distribution_lpdf(y | ...). Maybe this is missing that bit?

salauer avatar Apr 11 '22 18:04 salauer

I think the weights still aren't correct, at least for the Gaussian case. Ignoring the link and calling $w$ the weights vector, the weighted likelihood adds to target:

summands = -0.5 * log(6.283185307179586232 * sigma) - 0.5 * square((y - eta) / sigma);
dot_product(weights, summands);

or:

$$w^\top \left(-\frac{1}{2}\log2\pi\sigma - \frac{1}{2}\left(\frac{y - \eta}{\sigma}\right)^2\right)$$

If $Y_i \sim N(\eta_i, \sigma / \sqrt{w_i})$, we would expect:

$$\sum_{i=1}^N \left(-\frac{1}{2}\log2\pi\frac{\sigma^2}{w_i} -\frac{1}{2}w_i\left(\frac{y_i - \eta_i}{\sigma}\right)^2\right)$$

or vectorized:

$$-\frac{N}{2}\log2\pi\sigma^2 + \frac{1}{2}\sum_{i=1}^N\log w_i - \frac{1}{2}w^\top\left(\frac{y - \eta}{\sigma}\right)^2$$

Edit:

Or, if you just want to multiply the log-likelihood, put $\sigma^2$ under the square root.

vdorie avatar Jun 07 '22 19:06 vdorie