gratia icon indicating copy to clipboard operation
gratia copied to clipboard

Pointwise intervals using gratia vs predict.gam discrepancy

Open dshelldhillon opened this issue 4 years ago • 3 comments

It seems the confint() method in {gratia} isn't producing the correct predictions in a multiple time series GAM with group level smoothers for each factor and a random intercept for each group.

When simulating the data, I've simulated one series to have the same relationship with the response, but a higher average response as compared to the other one.

The predictions from confint() seem not to be taking in the random intercepts into account.

library(tidyverse)
library(mgcv)
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.8-27. For overview type 'help("mgcv-package")'.
library(gratia)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test



###  Create two time series with the same data generating process, but spiked with noise to create artificial differences  

x1 <- arima.sim(n = 1e3, list(ar = 0.4), rand.gen = function(n, ...) rnorm(n, 0,1)) 
x2 <- arima.sim(n = 1e3, list(ar = 0.4), rand.gen = function(n, ...) rnorm(n, 0,1))


## spike in noise to create a differnece in the two time series   

f1 <- x1 + 0.5*x1^2 + 0.8*x1^3 + rnorm(1e3,12,2)

f1 <- cbind(f1, x1) %>%
        data.frame() %>%
          mutate(sample = rep("x1",nrow(.))) %>%
            rename(y = f1, time = x1)

f2 <- x2 + 0.5*x2^2 + 0.8*x2^3 + rnorm(1e3,0,2)

f2 <- cbind(f2, x2) %>%
        data.frame() %>%
          mutate(sample = rep("x2",nrow(.))) %>%
            rename(y = f2, time = x2)

## Create the dataframe  
df <- bind_rows(f1, f2) %>%
  mutate(sample = as_factor(sample))


### Plot the raw data  

  df %>%
  ggplot(., aes(time, y, group = sample)) + geom_line(aes(color = sample))



### Fit the model  

sim_gam <- mgcv::gam(y ~ s(time, by = sample) + s(sample, bs = "re", k = 2), data = df,
                     method = "REML")


### Enter gratia - calculate pointwise confidence intervals  

grid <- expand.grid(time = seq(-3,3, by = 0.01), sample = levels(df$sample))

confint(sim_gam,parm = "time", newdata = grid, unconditional = T, type = "confidence") %>%
  ggplot(.,aes(time, est, group = sample)) + geom_line(aes(color = sample)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.5) + labs(title = "Pointwise from {Gratia}")



### Below is the code to calculate the same intervals manually   

pred_list <- predict.gam(sim_gam, newdata = grid, unconditional = T, se.fit = T)

preds <- pred_list$fit %>%
  tibble() %>% 
  janitor::clean_names() %>%
  rename(fit = x)

cis <- pred_list$se.fit %>%
  tibble() %>%
  janitor::clean_names() %>%
  rename(se = x)

bind_cols(preds, cis, grid) %>%
  mutate(lower = fit - 2*se, upper = fit + 2*se) %>%
  ggplot(.,aes(time, fit, group = sample)) + geom_line(aes(color = sample)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.5) + labs(title = "Pointwise by predict.gam")



sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 16299)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] janitor_1.2.0   gratia_0.2-11   mgcv_1.8-27     nlme_3.1-137   
#>  [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
#>  [9] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1  
#> [13] tidyverse_1.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.5 xfun_0.7         splines_3.5.3    haven_2.1.0     
#>  [5] lattice_0.20-38  snakecase_0.11.0 colorspace_1.4-1 vctrs_0.2.0     
#>  [9] generics_0.0.2   htmltools_0.3.6  yaml_2.2.0       rlang_0.4.0     
#> [13] pillar_1.4.2     glue_1.3.1       withr_2.1.2      modelr_0.1.4    
#> [17] readxl_1.3.1     lifecycle_0.1.0  munsell_0.5.0    gtable_0.3.0    
#> [21] cellranger_1.1.0 rvest_0.3.4      mvnfast_0.2.5    evaluate_0.14   
#> [25] labeling_0.3     knitr_1.23       highr_0.8        broom_0.5.2     
#> [29] Rcpp_1.0.2       scales_1.0.0     backports_1.1.4  jsonlite_1.6    
#> [33] hms_0.4.2        digest_0.6.21    stringi_1.4.3    cowplot_1.0.0   
#> [37] grid_3.5.3       cli_1.1.0        tools_3.5.3      magrittr_1.5    
#> [41] lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.3  zeallot_0.1.0   
#> [45] ellipsis_0.3.0   Matrix_1.2-15    xml2_1.2.0       lubridate_1.7.4 
#> [49] assertthat_0.2.1 rmarkdown_1.13   httr_1.4.0       R6_2.4.0        
#> [53] compiler_3.5.3

Created on 2019-10-04 by the reprex package (v0.3.0)

dshelldhillon avatar Oct 04 '19 18:10 dshelldhillon

This is probably some conflict with selecting terms (should be smooths); does it work if you don't specify parm?

I'll try to take a look later.

gavinsimpson avatar Oct 04 '19 19:10 gavinsimpson

Thank you! No - I tried leaving out parm but it throws an error.

Error in add_s(parm) : argument "parm" is missing, with no default

I tested it using the 0.2 -8 version as well, but the same thing.

Appreciate all your effort on this wonderful library!

dshelldhillon avatar Oct 04 '19 19:10 dshelldhillon

Having had another think about this, gratia:::confint.gam is doing the right thing. confint methods are traditionally for parameters of a model, not predictions from the model. As factor by smooths are centred, the confidence interval on the smooths (in your example, if I follow how you created it) should be essentially the same; they'd be different if the underlying polynomials were different.

What you want is a confidence interval on the combinations of the smooth plus the parametric terms involved in the by factor smooth. That's a reasonable request but not something for confint() given the precedence of what this function has traditionally been used for. I'll have a think about how to best provide what you want; from v 0.3.0 onward gratia hasfitted_samples() and predicted_samples() which could be used to do (almost) what you want, but they only know how to work with the full model specification not selected parts of it. More generally what you want is something similar to fitted_samples() but with an exclude option sensu predict.gam()'s exclude argument which can set some terms to zero (or perhaps terms to select only the ones you want...)

gavinsimpson avatar Jan 23 '20 04:01 gavinsimpson

I think this is now quite doable using fitted_values():

ds <- data_slice(sim_gam, time = seq(-3,3, by = 0.01), sample = evenly(sample))
fv <- fitted_values(sim_gam, data = ds)

fv |>
  ggplot(aes(x = time, y = .fitted, group = sample)) +
    geom_ribbon(aes(ymin = .lower_ci, ymax = .upper_ci, fill = sample), alpha = 0.2) +
    geom_line(aes(colour = sample))

gratia-issue-54

As this isn't something I would ever expect confint() to do as that's not it's intended use, I'm going to close this.

gavinsimpson avatar Mar 13 '24 21:03 gavinsimpson