DHARMa icon indicating copy to clipboard operation
DHARMa copied to clipboard

Handling of models fitted with weights

Open florianhartig opened this issue 7 years ago • 5 comments

A long-standing problem that I want to solve with the next release is the handling of the weights. The problem is the following:

Depending on the family, weights are either included in the simulations, or not. In some cases, they can't be included, because weights on the likelihood in a Poisson don't have a corresponding data-generating model. In some cases (like lmer), it seems to me that they were accidentally forgotten?

As DHARMa relies on the fact that the simulate() function simulates from the assumed likelihood, it is a quite essential information if weights are included in simulate or not (and ideally, they would be, as far as possible). Another problem is that models don't consistently warn if weights are not included in the simulation.

I think the current situation is the following

  • lm includes weights in simulations
  • glm includes weights for gaussian and binomial, but not for poisson
  • lme4 does not include weights in lmer, does not warn. Does not include in glmer, except binomial, but does warn. glmmTMB - weights are multiplied with the log likelihood, except for the binomial. Simulations are only reliable for binomial? glmmTMB does not give a warning in the simulation https://github.com/florianhartig/DHARMa/issues/146 spaMM - weights are / can only be included for distribution with variable dispersion, and control the dispersion -> simulations should be reliable?

Example

library(DHARMa)
library(lme4)
library(glmmTMB)
library(spamMM)

testData = createData(sampleSize = 200, overdispersion = 0, family = gaussian())

# lm weights are considered in simulate()
fit <- lm(observedResponse ~ Environment1 , weights = Environment1 +1, 
            data = testData)
simulateResiduals(fittedModel = fit, plot = T)

# glm gaussian still the same
fit <- glm(observedResponse ~ Environment1 , weights = Environment1 +1, 
          data = testData)
simulateResiduals(fittedModel = fit, plot = T)

# glm behaves nice, throws a warning that simulate ignores weights for poisson
fit <- glm(ceiling(exp(observedResponse)) ~ Environment1 , weights = Environment1 +1, 
           data = testData, family = "poisson")
simulateResiduals(fittedModel = fit, plot = T)

# lmer behaves bad, doesn't include weights even for Gaussian, but doesn't warn either
fit <- lmer(observedResponse ~ Environment1 + (1|group), weights = Environment1 +1, 
            data = testData)
simulateResiduals(fittedModel = fit, plot = T)

# glmer warns
fit <- glmer(ceiling(exp(observedResponse)) ~ Environment1 + (1|group), weights = Environment1 +1, 
            data = testData, family = "poisson")
simulateResiduals(fittedModel = fit, plot = T)

# glmmTMB does not warn

fit <- glmmTMB(ceiling(exp(observedResponse)) ~ Environment1 + (1|group), weights = Environment1 +1, 
             data = testData, family = "poisson")
simulateResiduals(fittedModel = fit, plot = T)

# spaMM does not warn, but seems to be simulating with correct (heteroskedastic) variance. How exactly 
fit <- HLfit(ceiling(exp(observedResponse)) ~ Environment1 + (1|group), prior.weights = Environment1 +1, 
               data = testData, family=poisson())
summary(fit)
simulateResiduals(fittedModel = fit, plot = T)

florianhartig avatar Nov 17 '16 16:11 florianhartig

Hello - thanks for creating this package,

I have been reviewing this issue of the weights in the package, I don't know if it is possible to think about the issue of weights, taking into account that in some datasets the weights represent repetitions of the same observation, that is, frequency_weights. Would it be possible in this case to multiply the residuals by the weights and thus deal with this problem?

naveranoc avatar Sep 01 '22 04:09 naveranoc

Hello Naveranoc,

which GLM structure / package are you talking about? If this is about binomial k/n, with weight = n, this should be handled correctly by DHARMa.

In general, as you point out, the problem is that the weight argument is overloaded with var_weights and freq_weights in many packages.

Best, F

florianhartig avatar Sep 01 '22 07:09 florianhartig

Hello @florianhartig,

I am referring to cases, for example, in packages like glmmtmb or pscl (especially in the cases of hurdle or zero inflated models), where weights can greatly simplify estimates and reduce the computation time of Dharma simulations. For example, in a real problem I have almost 10 million records, but by checking the observations that are the same (using the frequency_weights) I can reduce the observations to almost 100,000 records.

naveranoc avatar Sep 01 '22 14:09 naveranoc

Hi Naveranoc,

hmm ... OK, the way I understand it, weights are acting purely on the likelihood in glmmTMB, so if you have 2x the same observation, you could provide it as one observation, set weights = 2 in glmmTMB that should result in the same fit.

I suspect that if you test a correctly specified model with DHARMa for this situation, residuals should still be fine. The only issue is that there will be a different number of observations behind each residual. I'm not sure if this could create any spurious correlations in the residuals; however, it would certainly cost power.

Would you maybe have a simple example that I could use as a test case?

florianhartig avatar Sep 08 '22 15:09 florianhartig

Hi @florianhartig

Thinking about what you propose, I have created the following simulated example and I have generated the following simulated example. `library(glmmTMB) library(tidyverse)

set.seed(123L)

size_data = 10000

count_intercept = 3 zero_intercept = 2 betas <- tibble(category = c("a","b","c","d"), beta_conteo = c(1,2,3,4), beta_zero = c(0.1,0.2,0.3,0.4) ) %>% mutate(mu_cont = exp(if_else(category == "a", count_intercept+beta_conteo, beta_conteo)), p0 = exp(if_else(category == "a", zero_intercept+beta_zero, beta_zero))/(1+exp(if_else(category == "a", zero_intercept, beta_zero))) ) individual_data <- tibble(category = sample(size = size_data,x = c("a","b","c","d"),replace = T)) %>% left_join(betas, by = "category") %>% mutate(y = VGAM::rzipois(n = size_data,lambda = mu_cont,pstr0 = p0))

grouped_data <- individual_data %>% group_by(category,y) %>% summarise(freq_weights = n(),.groups = "drop")

glmm_zip_individual <- glmmTMB(formula = y ~ category, family = poisson(link = "log"), zilink = stats::make.link("logit"), ziformula = ~category,
data = individual_data)

glmm_zip_grouped <- glmmTMB(formula = y ~ category, family = poisson(link = "log"), zilink = stats::make.link("logit"), ziformula = ~category, weights = freq_weights,
data = grouped_data)

fixef(glmm_zip_individual) fixef(glmm_zip_grouped) `

Thinking about this, I have thought that those observations that are repeated and that have an equal weight, the problem would be that although they have the same prediction, if I ran a simulation on that record, I would obtain n different residual values. They could be somehow close to the one observed with the prediction. Could it work in an empirical function that also uses the weights for the corresponding calculation?

thank you very much for your help.

naveranoc avatar Sep 09 '22 20:09 naveranoc