SBC icon indicating copy to clipboard operation
SBC copied to clipboard

Issues with generated datasets of ordinal data for a cumulative probit model when running compute_SBC()

Open aftmnelbier11 opened this issue 11 months ago • 5 comments

I am encountering an errors regarding the priors when running compute_SBC() with generated datasets for a cumulative probit model. A bit of info about the data structure: I have an ordinal outcome variable with 7 levels. There are 61 responses per dataset.

It seems the error stems from some generated datasets not containing all response levels. Importantly, I have set the intercept priors individually, so when one level isn't included in the generated dataset, the error is thrown.

Is there a way to specify that all levels are generated in each simulation? Alternatively, would it make to increase the number of generated datasets and kick out those simulations that have this error?

Thanks in advance!

The error (which I get for multiple simulations and sometimes for multiple intercepts):

Simulation 3 resulted in error when fitting.

Error: The following priors do not correspond to any model parameter: 
Intercept_6 ~ normal(1.07, 1)
Function 'get_prior' might be helpful to you.

The code:

## set contrasts (sum-coding)
contrasts(df$group) = contr.sum(2)

##develop formula
bf.form = brms::bf(rating ~ group + cont1 + cont2 + 
                     group:cont1  + group:cont2)
code = "INT"

get_prior(bf.form, data = df, family = cumulative('probit'))

tibble(rating = 1:7) %>% 
  mutate(proportion = 1/7) %>% 
  mutate(cumulative_proportion = cumsum(proportion)) %>% 
  mutate(right_hand_threshold = qnorm(cumulative_proportion))

priors = c(
  prior(normal(-1.07, 1), class = Intercept, coef = 1),
  prior(normal(-0.57, 1), class = Intercept, coef = 2),
  prior(normal(-0.18, 1), class = Intercept, coef = 3),
  prior(normal(0.18, 1), class = Intercept, coef = 4),
  prior(normal(0.57, 1), class = Intercept, coef = 5),
  prior(normal(1.07, 1), class = Intercept, coef = 6),
  prior(normal(0, 1), class = b) 
)

##simulate data
generator <- SBC_generator_brms(bf.form, data = df, 
                                family = cumulative('probit'), init = 0.1, prior = priors,
                                thin = 50, warmup = 10000, refresh = 2000,
                                # Will generate the log density - this is useful, 
                                #but a bit computationally expensive - turned off
                                generate_lp = FALSE)

datasets <- generate_datasets(generator, 100)

backend <- SBC_backend_brms_from_generator(generator, chains = 4, thin = 1,
                                           warmup = 1500, iter = 6000,               
                                           init = 0.1) 

results <- compute_SBC(datasets, backend,
                       cache_mode = "results", 
                       cache_location = file.path(cache_dir, code))

aftmnelbier11 avatar Mar 13 '24 14:03 aftmnelbier11

Hi, thanks for reporting. You cannot force the simulator to cover all of the values everytime. My impression is that you would like to have a prior on the parameters that makes all levels appearing in the data very likely. This is hard to do by setting priors on individual parameters, but there is a simple workaround - you can drop the datasets that don't have all levels and not fit them at all. This is valid as discussed at https://hyunjimoon.github.io/SBC/articles/rejection_sampling.html

S you would generate a bit more datasets than you actually need and then do something like (code not actualy tested):

has_all_levels <- function(x) {
   return(length(unique(x$rating)) == 7)
}
good_datasets <- sapply(datasets$generated, FUN = has_all_lavels)
datasets_filtered <- datasets[good_datasets]

Does that make sense?

martinmodrak avatar Mar 13 '24 14:03 martinmodrak

Yes, this makes sense! To confirm I understood correctly, you would reject the datasets without all levels before running compute_SBC() then?

Thank you again.

aftmnelbier11 avatar Mar 13 '24 17:03 aftmnelbier11

Exactly, reject before computing. Also, technically you probably really need to reject only the datasets where the extreme values are not present. If some of the middle values do not occur, I'd expect the model to sample OK. But if datasets where the middle values do not occur are unrealistic, it is sensible to focus SBC on the more realistic part of the prior space.

Hope everything else works out for you.

martinmodrak avatar Mar 13 '24 19:03 martinmodrak

Perhaps another way to make this work is to use the "zero weight" trick described here. In this case you would add one row for each level to every dataset after generation which would make sure brms includes the parameters in the model but without influencing the results.

lunafazio avatar Mar 13 '24 22:03 lunafazio

This is also a good suggestion I will look into. Many thanks!

Regarding the rejection sampling, this code worked for me:

check_levels = function(x) {
  if (length(unique(x$rating)) == 7){
    return(x)
  }
}

# use function
datasets$generated <- sapply(datasets$generated, FUN = check_levels)

# get sim_ids for the simulations with all levels
good.sims = c()
for (i in 1:length(datasets$generated)) {
  if(!is.null(datasets$generated[[i]])){
    good.sims = c(good.sims, i)
  }
}

# remove simulations that do not have all levels for $generated and $variables
datasets$generated = datasets$generated[!sapply(datasets$generated,is.null)]
datasets$variables = datasets$variables[1:nrow(datasets$variables)  %in% good.sims,]

aftmnelbier11 avatar Mar 14 '24 11:03 aftmnelbier11