DHARMa icon indicating copy to clipboard operation
DHARMa copied to clipboard

Are qgam p-values correct for small sample sizes?

Open florianhartig opened this issue 4 years ago • 15 comments

I'm posting here a question via email from Ben Bolker regarding my DHARMa presentation at ISEC 2020

it feels like the qgam tests are asymptotic, so might be systematically off at small sample sizes?

## FIXME: try with changing sample/group sizes
## FIXME: speed up by using refit() (i.e., fit model once and refit()
##     rather than fitting every time? Do we need to use simulate()
##     to get realizations from the same set of predictor vars each time?
## FIXME:
testfun <- function(ss=200, family=poisson(), ...) {
    testData = createData(sampleSize = ss, family = family, ...)
    fitted <- glmer(observedResponse ~ Environment1 + (1|group),
                    family = "poisson", data = testData)
    res <- simulateResiduals(fitted)
    res2 <- recalculateResiduals(res, group=testData$group)
    return(testQuantiles(res2,plot=FALSE)$p.value)
}

set.seed(101)
options(warn=2)
pvals <- plyr::raply(250,testfun(), .progress="text")
png("phist.png")
par(las=1)
hist(pvals,breaks=100)
abline(h=250/100,col=2)
hist(log10(pvals),xlim=c(-100,0),breaks=100)
dev.off()

florianhartig avatar Jun 22 '20 17:06 florianhartig

Hi Ben, I will post this to @mfasiolo (developer of qgam), but I don't particularly think that qgam is sensitive to small sample sizes.

Matteo, for context, in my presentation the aggregated residuals flagged as significant although H0 is true, and I said that this was a random event - I played around with it a few times, sometimes it was significant, sometimes it wasn't but I'm not sure that it's exactly 5% as it should.

There is, however, always the other explanation that something is going on with the random effects. What I mean is that there is shrinkage on the estimation of the REs, and by grouping the residuals according to RE groups, we might pick up on this effect.

I would have usually preferred to group according to some other grouping variable, but this was what I had in my data.

florianhartig avatar Jun 22 '20 17:06 florianhartig

Hi Florian, do the p-values used here:

testQuantiles(res2,plot=FALSE)$p.value

come from qgam? If so they are based on the same asymptotic Gaussian approximation used for standard GAMs. Of course, asymptotic approximations work better with big samples, but it's difficult to say more without performing a thorough simulation study.

mfasiolo avatar Jun 23 '20 10:06 mfasiolo

yes, they do, I calculate the qgam p-values for the 3 quantiles, and then correct for multiple testing, see here https://github.com/florianhartig/DHARMa/blob/5faa7a6d47f9b18f9854d5aba54b4abb603eacd4/DHARMa/R/tests.R#L121

This works excellent btw. and I was very happy with that approach so far. OK, but I guess it boils down to running some simulations and checking when the approximation breaks down.

As said, in my simulation so far, I didn't see an indication that qgam would have elevate Type I error rates. Thanks for your thoughts!

florianhartig avatar Jun 23 '20 11:06 florianhartig

@bbolker provides some code to exemplify the (??possible??) problem, see below.

library(glmmTMB)
library(DHARMa) ## version 0.3.3.0
library(bbmle)  ## for AICtab

## data from Dobson and Barnett

dd <- structure(list(year = c(1984L, 1984L, 1984L, 1984L, 1985L, 1985L, 
1985L, 1985L, 1986L, 1986L, 1986L, 1986L, 1987L, 1987L, 1987L, 
1987L, 1988L, 1988L, 1988L, 1988L), quarter = c(1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L
), cases = c(1L, 6L, 16L, 23L, 27L, 39L, 31L, 30L, 43L, 51L, 
63L, 70L, 88L, 97L, 91L, 104L, 110L, 113L, 149L, 159L), date = c(1984, 
1984.25, 1984.5, 1984.75, 1985, 1985.25, 1985.5, 1985.75, 1986, 
1986.25, 1986.5, 1986.75, 1987, 1987.25, 1987.5, 1987.75, 1988, 
1988.25, 1988.5, 1988.75)), class = "data.frame", row.names = c(NA, 
-20L))

dd <- transform(dd,
                time=year-min(year)+(quarter-1)/4)

mpois <- glm(cases~poly(time,2), family=poisson, data=dd)
sum(residuals(mpois,"pearson")^2)/df.residual(mpois)  ## 1.65 - questionable

mnb2_mass <- MASS::glm.nb(cases~poly(time,2), data=dd)
## ridiculous theta estimate (implies -> poisson)

mnb2_g <- glmmTMB(cases~poly(time,2), data=dd, family=nbinom2)
## also crazy overdisp -> Poisson

mnb1_g <- update(mnb2_g, family=nbinom1)

AICctab(mpois,mnb2_mass,mnb2_g,mnb1_g,logLik=TRUE)  ## in fact, mpois is still best

## standard diagnostics look not-too-bad
op <- par(mfrow=c(2,2))
plot(mpois)
par(op)

## crazy resid-vs-predict problems
plot(simulateResiduals(mpois))
plot(simulateResiduals(mnb1_g))

image

florianhartig avatar Mar 10 '21 17:03 florianhartig

OK, so if I fit the same data with a gam, residuals look fine. Note that simulations from family = nb are not yet implemented in mgcv, so we can only test Poisson

I tend to think DHARMa / qgam is right, and there is probably a slight deviation of the functional response from a 2nd order polynom, and because qgam realizes that a straight line does not fit, but there are very few data points, it gets this cracy shape.

library(mgcv)

mgam <- gam(cases~s(time), family=poisson, data=dd)
plot(mgam)
res <- simulateResiduals(mgam, plot = T)

# can't test this with DHARMa because nb simulations not impelemented yet
mgam <- gam(cases~s(time), family=nb, data=dd)
plot(mgam)

image

florianhartig avatar Mar 10 '21 17:03 florianhartig

p.s.: as we fit against time, probably good to test for temporal autocorrelation as well, but this was negative

testTemporalAutocorrelation(res)

image

florianhartig avatar Mar 10 '21 17:03 florianhartig

I'm less worried about the p-values than the apparent undersmoothing. Maybe it would help to pass method="REML" in argGam ... ??

bbolker avatar Mar 10 '21 18:03 bbolker

Hi Ben, do you mean in qgam (for the residuals) or in the gam?

florianhartig avatar Mar 10 '21 18:03 florianhartig

In qgam for the residuals of the quadratic-Poisson GLM fit.

bbolker avatar Mar 10 '21 18:03 bbolker

OK, I have made some further simulations (see below). Regardless of the previous comments (which is that I think it's not unreasonable that a gam fits better), it is true that the simulation for a similar case below has around 15% type I error, so I tend to agree that qgam may be overly sensitive for small n.

I think I have to investigate this more systematically (and with more time), including the Reml option. @mfasiolo - do you have any further ideas here?

returnStatistics <- function(control = 0){
  testData = createData(sampleSize = 20, family = poisson(), overdispersion = control, randomEffectVariance = 0)
  fittedModel <- glm(observedResponse ~ Environment1, data = testData, family = poisson())
  res <- simulateResiduals(fittedModel = fittedModel, n = 250)
  out <- c(testQuantiles(res, plot = FALSE)$p.value, testUniformity(res, plot = FALSE)$p.value)
  return(out)
}

# testing a single return
returnStatistics()

# running benchmark
out = runBenchmarks(returnStatistics, nRep = 500)
plot(out)

florianhartig avatar Mar 10 '21 18:03 florianhartig

I'm not sure this really belongs on here, but following my previous example: (1) argGam=list(method="REML") doesn't appear to do anything (perhaps this is already being used?) (2) ranking vs not ranking the predictor makes a big difference; (3) setting k=7 vs k=8 makes a big difference (k=8 and k=10 are basically equivalent)

qgam

s1 <- simulateResiduals(mpois)
s2 <- simulateResiduals(mnb1_g)
png("qgam.png")
par(mfrow=c(2,2),las=1)
plotResiduals(s1)
plotResiduals(s1, rank=TRUE)
plot(fpr,sr)
q5 <- qgam::qgam(sr ~ s(fpr, k = 10), qu = 0.5, data=qd)
lines(fpr,predict(q5))
plot(rfpr,sr)
q5s <- qgam::qgam(sr ~ s(rfpr, k = 10), qu = 0.5, data=qd)
q5sa <- qgam::qgam(sr ~ s(rfpr, k = 7), qu = 0.5, data=qd)
lines(rfpr,predict(q5s))
lines(rfpr,predict(q5sa),col=4)
text(0.8,0.5,"k=7",col=4)
text(0.6,0.7,"k=10")
dev.off()

bbolker avatar Mar 10 '21 20:03 bbolker

Hello everyone,

looking at the default DHARMa plot i noticed that it looks a little bit like a cosine function. So i've looked into the original model and the raw residuals seem to fit really well to a cosine with period 2. Now this posses a bit of a problem, because it is on a linear scale not the log scale of the original model, but including this cosine in the model appears to be fairly significant:

mpois <- glm(cases~poly(time,2), family=poisson, data=dd)
sum(residuals(mpois,"pearson")^2)/df.residual(mpois)  ## 1.65 - questionable

plot(dd[, c("time", "cases")], main = "model")
lines(dd$time, predict(mpois, type = "response"))

plot(dd$time, dd$cases -  predict(mpois, type = "response"), main = "raw residuals")

# data for fittng on raw residuals
y <- dd$cases -  predict(mpois, type = "response")
t <- dd$time
# eyeballed period of the cosine
per <- 2
reslm <- lm(y ~ cos(2*pi/per*t))
summary(reslm)
plot(t, y, main = "cosine with period 2 through raw residuals")
lines(t, predict(reslm))

mpois_cos <- glm(cases~poly(time,2) + cos(2*pi/per*t), family=poisson, data=dd)
# cosine is significant
summary(mpois_cos)
# still wild residuals, but no longer significant
simulateResiduals(mpois_cos, plot = TRUE)
# excellent dispersion
sum(residuals(mpois_cos,"pearson")^2)/df.residual(mpois_cos)


# identity link since cosine "exists" on raw residuals not log(residuals)
mpois_cos_id <- glm(cases~poly(time,2) + cos(2*pi/per*t), family=poisson(link = "identity"), data=dd)
# still everything significant
summary(mpois_cos_id)
# nice residuals, but underdispersed, see below
simulateResiduals(mpois_cos_id, plot = TRUE)
# substantially underdispersed 0.33
sum(residuals(mpois_cos_id,"pearson")^2)/df.residual(mpois_cos_id)

#no idea, what this tells me
AICctab(mpois, mpois_cos, mpois_cos_id, logLik=TRUE)

I don't know what to make of this. Maybe the auto correlation test should have picked up on it. It does have p-value < 0.1, so ... Of course Florian's simulation does show, that testQuantiles produces too many significant p values, but this seems a bad example if there really is a trend in the residuals.

Istalan avatar Mar 10 '21 22:03 Istalan

To reiterate on the temporal auto correlation, it actually does pick up on the trend, if provided with the baseline time variable

s1 <- simulateResiduals(mpois)
testTemporalAutocorrelation(s1, time = dd$time)

autocorr

Istalan avatar Mar 11 '21 09:03 Istalan

wups, @Istalan, thanks for that, classic error from me (classic in the sense I from questions I realised that users also forget to specify the time all the time).

OK, after this experience, I think I should remove the current setting in those functions (which only provide a message in case no time is provided and use random time) and throw a real error.

@bbolker - but then this might also be a nice story for your course, right? Weird residuals explained by periodic signal?

florianhartig avatar Mar 11 '21 09:03 florianhartig

@bbolker, some additional, more fundamental thoughts: regardless of the specific (possible) type I error and smoothing settings in qgam, I have been contemplating for quite some time about how much "visual" feedback I should give to the users about the residual patterns. In principle, we could also just provide the residuals and let people do the smoothing themselves (in their head). It would be an interesting experiment though to ask your students in the course to interpret different options for the residual plots:

a) residuals ~ pred scatter plot without any further guidance

b) residuals ~ pred with loess smoother (DHARMa option quantreg = F)

c) residuals ~ pred with quantreg, but no significance (this was the case in previous DHARMa versions, would have to do this by hand now)

d) residuals ~ pred with quantreg with significance (DHARMa default)

And see how their interpretation differs. As I said, my experience has been that option a-c are not working very well, because for small n, you will get more scatter and wiggleness in the gam, but people have a hard time to factor in the fact that for small n, this scatter / wiggles are often not significantly different from random. This is basically to say: also humans (in my experience) seem to have strongly increased type I error rates for small n in this type of task.

Thus, I have found the move to d) overall an improvement for type I error rate. This is of course only a gut feeling.

For your course, if you are unhappy with the qgam results, maybe a pragmatic solution is to set quantreg = F?

florianhartig avatar Mar 11 '21 09:03 florianhartig