GPBoost
GPBoost copied to clipboard
Get all pairwise comparisons from model
Hi @fabsig,
Thank you for the great package! I'm really new to this and I am trying to use it with my single-cell data as I have a mixture of independent and longitudinal groups as shown here:
data <- data.frame(
celltype_a = c(
1,1,1,1,
1,1,0,1,
0,1,0,0
),
group_id = c(
"A", "A", "A", "A",
"Pre", "Pre", "Pre", "Pre",
"Post", "Post", "Post", "Post"
),
individual_id = c(
"A1", "A2", "A3", "A4",
"P1", "P2", "P3", "P4",
"P1", "P2", "P3", "P4"
),
pool_id = c(
rep(c("batch1", "batch2", "batch3", "batch4"))
)
)
> data
celltype_a group_id individual_id pool_id
1 1 A A1 batch1
2 1 A A2 batch2
3 1 A A3 batch3
4 1 A A4 batch4
5 1 Pre P1 batch1
6 1 Pre P2 batch2
7 0 Pre P3 batch3
8 1 Pre P4 batch4
9 0 Post P1 batch1
10 1 Post P2 batch2
11 0 Post P3 batch3
12 0 Post P4 batch4
So I want to see whether there are significant differences across group_id levels in terms of proportion of one cell type and then overall change along A -> Pre - > Post trajectory. In group_id, Pre and Post are repeated measures, while A is independent.
This is the output that I got and I wanted to confirm that is it not possible to get the comparison between Pre and Post? With lme4, I am getting the information for pairwise comparisons with the emmeans package.
fixed_effects_matrix <- model.matrix(celltype_a ~ group_id + pool_id, data = data)
mod_gpb <- fitGPModel(likelihood = "bernoulli_logit",
X = fixed_effects_matrix,
group_data = data[["individual_id"]],
y = data[["celltype_a"]], params = list(std_dev = TRUE))
summary(mod_gpb)
> summary(mod_gpb)
=====================================================
Model summary:
Log-lik AIC BIC
0.00 14.00 17.39
Nb. observations: 12
Nb. groups: 8 (Group_1)
-----------------------------------------------------
Covariance parameters (random effects):
Param.
Group_1 3e-04
-----------------------------------------------------
Linear regression coefficients (fixed effects):
Param. Std. dev. z value P(>|z|)
(Intercept) 47.2072 4413.918 0.0107 0.9915
group_idPost -63.4535 4801.050 -0.0132 0.9895
group_idPre -31.5929 3809.104 -0.0083 0.9934
pool_idbatch2 32.9021 4970.412 0.0066 0.9947
pool_idbatch3 -31.3698 3454.073 -0.0091 0.9928
pool_idbatch4 0.1634 2844.962 0.0001 1.0000
=====================================================
Thank you so much!
Best, Liezel
Thank you for your interest in GPBoost! I think this is currently not possible without some additional manual work on the user side.
- Can you elaborate a little on what you want to test exactly? H0: beta(group_idPost) = beta(group_idPre)?
- Could you provide code for how this can be done with
emmeans? - Do you know whether this can also be done with
lmerTest?
I am asking these questions as I might implement such features in GPBoost in the future.
Hi @fabsig,
Apologies for the late reply. Let me actually share a better sample data and the wrapper functions I wrote based on MASC, lme4 and emmeans. So my goal here is to see which cell types vary across 3 groups of individuals (healthy, before_treatment and after_treatment). I'd like to know the trends along the trajectory healthy -> before_treatment -> after_treatment (before and after treatment are from the same patients with the condition) and also do pairwise comparisons between groups. MASC is actually a tool for this but I modified to work with >2 groups and also I had to use emmeans to get all pairwise comparisons in addition to the comparison of the groups vs. a reference group (healthy for example). Here are the two key functions: 1) masc_modified() which is a modified version of [MASC (https://github.com/immunogenomics/masc/blob/master/R/masc.R. The function fits a model for each cell type. 2) pairwise_contrasts() which is a wrapper for emmeans() to extract pairwise comparisons from the glmm produced by masc_modified().
library(foreach)
library(doParallel)
library(tidyverse)
library(lme4)
library(emmeans)
masc_modified <- function (
dataset,
cluster,
contrast,
random_effects = NULL,
fixed_effects = NULL,
verbose = FALSE,
save_models = FALSE,
save_model_dir = ".",
n_cores = 1,
p.adjust_method = "BH"
) {
# Generate design matrix from cluster assignments
cluster <- as.character(cluster)
designmat <- model.matrix(~ cluster + 0, data.frame(cluster = cluster))
dataset <- dataset[, c(contrast, random_effects, fixed_effects)]
dataset <- cbind(designmat, dataset)
# Create model formulas
if (!is.null(fixed_effects) && !is.null(random_effects)) {
model_rhs <- paste0(
c(paste0(fixed_effects, collapse = " + "),
paste0("(1|", random_effects, ")", collapse = " + ")),
collapse = " + "
)
if (verbose == TRUE) {
message(paste("Using null model:", "cluster ~", model_rhs))
}
} else if (!is.null(fixed_effects) && is.null(random_effects)) {
model_rhs <- paste0(fixed_effects, collapse = " + ")
if (verbose == TRUE) {
message(paste("Using null model:", "cluster ~", model_rhs))
# For now, do not allow models without mixed effects terms
stop("No random effects specified")
}
} else if (is.null(fixed_effects) && !is.null(random_effects)) {
model_rhs <- paste0("(1|", random_effects, ")", collapse = " + ")
if (verbose == TRUE) {
message(paste("Using null model:", "cluster ~", model_rhs))
}
} else {
model_rhs <- "1" # only includes intercept
if (verbose == TRUE) {
message(paste("Using null model:", "cluster ~", model_rhs))
stop("No random or fixed effects specified")
}
}
##### Parallel start
if(n_cores > 1){
cl <- makeCluster(n_cores)
registerDoParallel(cl)
`%op%` <- `%dopar%`
print(paste0("Running with ", n_cores, " cores."), quote = FALSE)
} else {
`%op%` <- `%do%`
}
cluster_lvls <- grep("^cluster", colnames(dataset), value = TRUE)
to_export <- c("dataset", "cluster_lvls", "model_rhs", "verbose")
cluster_models <- foreach(
i = seq_along(cluster_lvls),
#.combine = "c",
.inorder = FALSE,
.packages = "tidyverse",
.export=to_export,
.noexport=setdiff(ls(), to_export)
) %op% {
# Run nested mixed-effects models for each cluster
test_cluster <- cluster_lvls[[i]]
if (verbose == TRUE) {
message(paste("Creating logistic mixed models for", test_cluster))
}
null_fm <- as.formula(paste0(
c(paste0(test_cluster, " ~ 1 + "), model_rhs), collapse = ""
))
# In R, intercept is automatically added unless explicitly removed with 0 or -1
full_fm <- as.formula(paste0(
c(paste0(test_cluster, " ~ ", contrast, " + "), model_rhs), collapse = ""
))
# Run null and full mixed-effects models
null_model <- lme4::glmer(
formula = null_fm,
data = dataset,
family = binomial,
nAGQ = 1,
verbose = 0,
control = glmerControl(optimizer = "bobyqa")
)
full_model <- lme4::glmer(
formula = full_fm,
data = dataset,
family = binomial,
nAGQ = 1,
verbose = 0,
control = glmerControl(optimizer = "bobyqa")
)
# LRT test to assess whether contrast variable improves model fit
model_lrt <- anova(null_model, full_model)
# Calculate Wald confidence intervals for fixed effect coefficients
fixed_effect_names <- names(fixef(full_model))
ci_mx <- confint.merMod(full_model, method = "Wald", parm = fixed_effect_names)
# Collect other info including odds ratio
model_info <- ci_mx
coeffs <- fixef(full_model)
model_info <- cbind(coef = coeffs[rownames(model_info)], model_info)
model_info <- exp(model_info)
colnames(model_info) <- c("OR", "OR.95pct.ci.lower", "OR.95pct.ci.upper")
# Tidy data
model_df <- model_info %>%
as.data.frame() %>%
tibble::rownames_to_column("var") %>%
pivot_longer(-var, names_to = "statistic") %>%
unite("statistic", var, statistic, sep = ".", remove = FALSE)
#pivot_wider(names_from = name, values_from = value)
# Pairwise contrasts with emmeans()
contrasts_df <- pairwise_contrasts(
full_model,
contrast = contrast,
adjust_contrast = "tukey",
save_emmeans_output = TRUE,
save_dir = save_model_dir
)
model_df <- bind_rows(model_df, contrasts_df) %>%
select(colnames(contrasts_df))
# Save model objects to list
models <- list()
models$cluster_name <- test_cluster
models$null_model <- null_model
models$full_model <- full_model
models$model_lrt <- model_lrt
models$model_df <- model_df
return(models)
}
if(n_cores > 1){
stopCluster(cl)
}
##### Parallel end
# Organize results into output dataframe
names(cluster_models) <- sapply(cluster_models, USE.NAMES = FALSE, function(x) x$cluster_name)
output <- lapply(cluster_models, function(x) x$model_df %>% as.data.frame())
# Reformatting to original MASC output
# output <- data.frame(
# cluster = names(cluster_models),
# size = colSums(designmat[ ,names(cluster_models), drop = FALSE])
# )
# output$model.pvalue <- sapply(cluster_models, function(x) x$model_lrt[["Pr(>Chisq)"]][2])
# output$model.padjust <- p.adjust(output$model.pvalue, method = p.adjust_method)
# output <- cbind.data.frame(
# output,
# do.call("rbind", lapply(cluster_models, function(x) x$model_info))
# )
# Return MASC results and save models if specified
if (save_models) {
saveRDS(cluster_models, file.path(save_model_dir, "masc_modified.modelobj.rds"))
return(output)
} else {
return(output)
}
}
pairwise_contrasts <- function (
model,
contrast,
adjust_contrast,
save_emmeans_output = TRUE,
save_dir = ".",
...
) {
emm_emmeans <- emmeans(model, as.formula(paste("~ ", contrast)), adjust = "none")
emm_contrasts <- contrast(emm_emmeans, "revpairwise", adjust = adjust_contrast) %>%
# Add confidence intervals and test for significance of estimates
summary(level = 0.95, infer = c(TRUE, TRUE), type = "response")
emm_emmeans <- emm_emmeans %>%
summary(level = 0.95, infer = c(TRUE, TRUE), type = "response")
emm <- list(emm_emmeans, emm_contrasts)
if (save_emmeans_output) {
saveRDS(emm, file.path(save_dir, "emmeans_output.rds"))
}
# Store information into a data frame
lst <- lapply(emm, function(x) {
tibble(x) %>%
pivot_longer(-1, names_to = "statistic") %>%
rename(contrast = 1)
})
df <- do.call("rbind", lst) %>% as.data.frame() %>%
mutate(
statistic = recode(
statistic,
odds.ratio = "OR",
p.value = "OR.pvalue",
asymp.UCL = "OR.95pct.ci.upper",
asymp.LCL = "OR.95pct.ci.lower"
)
) %>%
separate(
contrast, c("var", "contrast"), sep = " / ", fill = "right", remove = FALSE
)
return(df)
}
This is the code for running with the sample data I shared:
data <- read_csv("sample_data.csv")
data <- data %>%
mutate(
grouping_id = individual_id,
pool_id = factor(pool_id),
group_id = factor(group_id, ordered = TRUE),
)
masc_out <- masc_modified(
dataset = data,
cluster = data$labels,
contrast = "group_id",
random_effects = "individual_id",
fixed_effects = "pool_id",
verbose = FALSE,
save_models = TRUE,
save_model_dir = ".",
n_cores = 1,
p.adjust_method = "BH"
)
About lmerTest, from my understanding I can't use it for binomial glmm, also I don't think it gives pairwise comparisons but it's in my to do list actually to test it out for certain use cases just needing LMM.
Let me know if you need more information, thank you!!!
Best, Liezel
Thank you very much for the clarifying information!
I can confirm that this is currently not possible with GPBoost out-of-the-box. It would, however, be a nice add-on for GPBoost whether by (i) creating GPBoost internal functions that do the same thing or (ii) adding support for GPBoost to the emmeans package.
To be honest, I do not plan to work on this in the near future, but contributions are very welcome.