brms copied to clipboard
Cross-validation of CAR models cannot be done with stratified folds
Hi there Paul -- first off thanks so much for all your work! I appreciate the improvements to the cross-validation options but I came across an issue :confused:
I am trying to implement "block" cross validation of a CAR model by sequentially holding out multiple regions for each fold of the CV. However this gives the error message: "Error: Cannot handle new locations in CAR models." I cannot determine what is going on in the traceback.
Here is a minimal example:
adjmat <- rbind('a'=c(0,1,0,0),
region <- sample(letters[1:4], 100, replace = TRUE)
x <- sample(1:10, 100, replace = TRUE)
y <- 2 * x + 5 + rnorm(100, 0, 2) + as.numeric(factor(region))
dat <- data.frame(region, x, y)
fit <- brm(formula = y ~ x, data = dat, family = 'gaussian', autocor = cor_car(adjmat, formula = ~ 1|region),
chains = 2, iter = 500, warmup = 250)
# Random CV (does not throw error)
cv_random <- kfold(fit, K = 4, Ksub = as.array(1), chains = 2, iter = 500, warmup = 250, save_fits = TRUE, group = NULL, seed = 111)
# Blocked CV (throws error)
cv_blocked <- kfold(fit, K = 4, Ksub = as.array(1), chains = 2, iter = 500, warmup = 250, save_fits = TRUE, group = 'region', seed = 111)
# Also throws error if specifying stratified folds
cv_blocked <- kfold(fit, K = 4, Ksub = as.array(1), folds = 'stratified', chains = 2, iter = 500, warmup = 250, save_fits = TRUE, group = 'region', seed = 111)
The problem is that CAR structures cannot handle new locations (regions in your case) that were not present when fitting the model but are present when predicting new data. Stratified folds should work if you have enough observations per region so that each region is present in every fold. I will take a closer look at your example to see whats wrong.
Thanks for getting back to me! I had previously constructed the holdout datasets so that each region is present in every fold, but a reviewer argued that this does not correctly address spatial dependence and gives artificially low RMSE. So I wanted to address the comment by holding out entire regions one by one. I guess what you are saying is that the CAR structure inherently cannot support that?
Not in the current implementation. I don't know enough of CAR models to tell you if it is possible or not in general. So if you can point me to some ideas to make this possible I am happy to take a look and see if we can get make this possible eventually.
Thanks so much! The paper cited by the reviewer was this one by Roberts et al.: I will try to track down some more ideas.
An additional resource which describes prediction in CAR models is a paper by Ver Hoef et al. that came out last year in Ecological Monographs: The appendix of that paper has the algorithm used for prediction. There is also a GitHub repo that goes with the paper: I think their methods can be used to generate fitted values for the regions that are missing from the holdout datasets.
Thanks! I will take a look, but I don't know when I will be able to implement any satisfying solution.