GPBoost icon indicating copy to clipboard operation
GPBoost copied to clipboard

Get all pairwise comparisons from model

Open liezeltamon opened this issue 6 months ago • 2 comments

Hi @fabsig,

Thank you for the great package! I'm really new to this and I am trying to use it with my single-cell data as I have a mixture of independent and longitudinal groups as shown here:

data <- data.frame(
  celltype_a = c(
    1,1,1,1,
    1,1,0,1,
    0,1,0,0
  ),
  group_id = c(
    "A", "A", "A", "A",
    "Pre", "Pre", "Pre", "Pre",
    "Post", "Post", "Post", "Post"
  ),
  individual_id = c(
    "A1", "A2", "A3", "A4",
    "P1", "P2", "P3", "P4",
    "P1", "P2", "P3", "P4"
  ),
  pool_id = c(
    rep(c("batch1", "batch2", "batch3", "batch4"))
  )
)

> data
   celltype_a group_id individual_id pool_id
1           1        A            A1  batch1
2           1        A            A2  batch2
3           1        A            A3  batch3
4           1        A            A4  batch4
5           1      Pre            P1  batch1
6           1      Pre            P2  batch2
7           0      Pre            P3  batch3
8           1      Pre            P4  batch4
9           0     Post            P1  batch1
10          1     Post            P2  batch2
11          0     Post            P3  batch3
12          0     Post            P4  batch4

So I want to see whether there are significant differences across group_id levels in terms of proportion of one cell type and then overall change along A -> Pre - > Post trajectory. In group_id, Pre and Post are repeated measures, while A is independent.

This is the output that I got and I wanted to confirm that is it not possible to get the comparison between Pre and Post? With lme4, I am getting the information for pairwise comparisons with the emmeans package.

fixed_effects_matrix <- model.matrix(celltype_a ~ group_id + pool_id, data = data)
mod_gpb <- fitGPModel(likelihood = "bernoulli_logit",
                      X = fixed_effects_matrix, 
                      group_data = data[["individual_id"]], 
                      y = data[["celltype_a"]], params = list(std_dev = TRUE))
summary(mod_gpb)

> summary(mod_gpb)
=====================================================
Model summary:
Log-lik     AIC     BIC 
   0.00   14.00   17.39 
Nb. observations: 12
Nb. groups: 8 (Group_1)
-----------------------------------------------------
Covariance parameters (random effects):
        Param.
Group_1  3e-04
-----------------------------------------------------
Linear regression coefficients (fixed effects):
                Param. Std. dev. z value P(>|z|)
(Intercept)    47.2072  4413.918  0.0107  0.9915
group_idPost  -63.4535  4801.050 -0.0132  0.9895
group_idPre   -31.5929  3809.104 -0.0083  0.9934
pool_idbatch2  32.9021  4970.412  0.0066  0.9947
pool_idbatch3 -31.3698  3454.073 -0.0091  0.9928
pool_idbatch4   0.1634  2844.962  0.0001  1.0000
=====================================================

Thank you so much!

Best, Liezel

liezeltamon avatar May 14 '25 16:05 liezeltamon

Thank you for your interest in GPBoost! I think this is currently not possible without some additional manual work on the user side.

  • Can you elaborate a little on what you want to test exactly? H0: beta(group_idPost) = beta(group_idPre)?
  • Could you provide code for how this can be done with emmeans?
  • Do you know whether this can also be done with lmerTest?

I am asking these questions as I might implement such features in GPBoost in the future.

fabsig avatar May 15 '25 06:05 fabsig

Hi @fabsig,

Apologies for the late reply. Let me actually share a better sample data and the wrapper functions I wrote based on MASC, lme4 and emmeans. So my goal here is to see which cell types vary across 3 groups of individuals (healthy, before_treatment and after_treatment). I'd like to know the trends along the trajectory healthy -> before_treatment -> after_treatment (before and after treatment are from the same patients with the condition) and also do pairwise comparisons between groups. MASC is actually a tool for this but I modified to work with >2 groups and also I had to use emmeans to get all pairwise comparisons in addition to the comparison of the groups vs. a reference group (healthy for example). Here are the two key functions: 1) masc_modified() which is a modified version of [MASC (https://github.com/immunogenomics/masc/blob/master/R/masc.R. The function fits a model for each cell type. 2) pairwise_contrasts() which is a wrapper for emmeans() to extract pairwise comparisons from the glmm produced by masc_modified().

library(foreach)
library(doParallel)
library(tidyverse)
library(lme4)
library(emmeans)

masc_modified <- function (
    dataset,
    cluster,
    contrast,
    random_effects = NULL,
    fixed_effects = NULL,
    verbose = FALSE,
    save_models = FALSE,
    save_model_dir = ".",
    n_cores = 1,
    p.adjust_method = "BH"
) {
  
  # Generate design matrix from cluster assignments
  
  cluster <- as.character(cluster)
  designmat <- model.matrix(~ cluster + 0, data.frame(cluster = cluster))
  dataset <- dataset[, c(contrast, random_effects, fixed_effects)]
  dataset <- cbind(designmat, dataset)
  
  # Create model formulas
  if (!is.null(fixed_effects) && !is.null(random_effects)) {
    model_rhs <- paste0(
      c(paste0(fixed_effects, collapse = " + "),
        paste0("(1|", random_effects, ")", collapse = " + ")),
      collapse = " + "
    )
    if (verbose == TRUE) {
      message(paste("Using null model:", "cluster ~", model_rhs))
    }
  } else if (!is.null(fixed_effects) && is.null(random_effects)) {
    model_rhs <- paste0(fixed_effects, collapse = " + ")
    if (verbose == TRUE) {
      message(paste("Using null model:", "cluster ~", model_rhs))
      # For now, do not allow models without mixed effects terms
      stop("No random effects specified")
    }
  } else if (is.null(fixed_effects) && !is.null(random_effects)) {
    model_rhs <- paste0("(1|", random_effects, ")", collapse = " + ")
    if (verbose == TRUE) {
      message(paste("Using null model:", "cluster ~", model_rhs))
    }
  } else {
    model_rhs <- "1" # only includes intercept
    if (verbose == TRUE) {
      message(paste("Using null model:", "cluster ~", model_rhs))
      stop("No random or fixed effects specified")
    }
  }
  
  ##### Parallel start
  
  if(n_cores > 1){
    cl <- makeCluster(n_cores)
    registerDoParallel(cl)
    `%op%` <- `%dopar%`
    print(paste0("Running with ", n_cores, " cores."), quote = FALSE)
  } else {
    `%op%` <- `%do%`
  }
  
  cluster_lvls <- grep("^cluster", colnames(dataset), value = TRUE)
  to_export <- c("dataset", "cluster_lvls", "model_rhs", "verbose")
  cluster_models <- foreach(
    i = seq_along(cluster_lvls),
    #.combine = "c",
    .inorder = FALSE,
    .packages = "tidyverse",
    .export=to_export,
    .noexport=setdiff(ls(), to_export)
  ) %op% {
    
    # Run nested mixed-effects models for each cluster

    test_cluster <- cluster_lvls[[i]]
    if (verbose == TRUE) {
      message(paste("Creating logistic mixed models for", test_cluster))
    }
    null_fm <- as.formula(paste0(
      c(paste0(test_cluster, " ~ 1 + "), model_rhs), collapse = ""
    ))
    # In R, intercept is automatically added unless explicitly removed with 0 or -1
    full_fm <- as.formula(paste0(
      c(paste0(test_cluster, " ~ ", contrast, " + "), model_rhs), collapse = ""
    ))
    # Run null and full mixed-effects models
    null_model <- lme4::glmer(
      formula = null_fm,
      data = dataset,
      family = binomial,
      nAGQ = 1,
      verbose = 0,
      control = glmerControl(optimizer = "bobyqa")
    )
    full_model <- lme4::glmer(
      formula = full_fm,
      data = dataset,
      family = binomial,
      nAGQ = 1,
      verbose = 0,
      control = glmerControl(optimizer = "bobyqa")
    )
    # LRT test to assess whether contrast variable improves model fit
    model_lrt <- anova(null_model, full_model)
    # Calculate Wald confidence intervals for fixed effect coefficients
    fixed_effect_names <- names(fixef(full_model))
    ci_mx <- confint.merMod(full_model, method = "Wald", parm = fixed_effect_names)
    # Collect other info including odds ratio
    model_info <- ci_mx 
    coeffs <- fixef(full_model)
    model_info <- cbind(coef = coeffs[rownames(model_info)], model_info)
    model_info <- exp(model_info)
    colnames(model_info) <- c("OR", "OR.95pct.ci.lower", "OR.95pct.ci.upper")
    
    # Tidy data
    model_df <- model_info %>%
      as.data.frame() %>% 
      tibble::rownames_to_column("var") %>%
      pivot_longer(-var, names_to = "statistic") %>%
      unite("statistic", var, statistic, sep = ".", remove = FALSE)
      #pivot_wider(names_from = name, values_from = value)
    
    # Pairwise contrasts with emmeans()
    contrasts_df <- pairwise_contrasts(
      full_model,
      contrast = contrast,
      adjust_contrast = "tukey",
      save_emmeans_output = TRUE,
      save_dir = save_model_dir
    )
    
    model_df <- bind_rows(model_df, contrasts_df) %>% 
      select(colnames(contrasts_df))
    
    # Save model objects to list
    models <- list()
    models$cluster_name <- test_cluster
    models$null_model <- null_model
    models$full_model <- full_model
    models$model_lrt <- model_lrt
    models$model_df <- model_df
    
    return(models)
    
  }
  
  if(n_cores > 1){
    stopCluster(cl)
  }
  
  ##### Parallel end
 
  # Organize results into output dataframe
  
  names(cluster_models) <- sapply(cluster_models, USE.NAMES = FALSE, function(x) x$cluster_name)
  output <- lapply(cluster_models, function(x) x$model_df %>%  as.data.frame())
  
  # Reformatting to original MASC output
  # output <- data.frame(
  #   cluster = names(cluster_models),
  #   size = colSums(designmat[ ,names(cluster_models), drop = FALSE])
  # )
  # output$model.pvalue <- sapply(cluster_models, function(x) x$model_lrt[["Pr(>Chisq)"]][2])
  # output$model.padjust <- p.adjust(output$model.pvalue, method = p.adjust_method)
  # output <- cbind.data.frame(
  #   output,
  #   do.call("rbind", lapply(cluster_models, function(x) x$model_info))
  # )

  # Return MASC results and save models if specified
  if (save_models) {
    saveRDS(cluster_models, file.path(save_model_dir, "masc_modified.modelobj.rds"))
    return(output)
  } else {
    return(output)
  }
   
}

pairwise_contrasts <- function (
    model,
    contrast,
    adjust_contrast,
    save_emmeans_output = TRUE,
    save_dir = ".",
    ...
) {
  
  emm_emmeans <- emmeans(model, as.formula(paste("~ ", contrast)), adjust = "none")
  emm_contrasts <- contrast(emm_emmeans, "revpairwise", adjust = adjust_contrast) %>% 
    # Add confidence intervals and test for significance of estimates
    summary(level = 0.95, infer = c(TRUE, TRUE), type = "response")
  
  emm_emmeans <- emm_emmeans %>% 
    summary(level = 0.95, infer = c(TRUE, TRUE), type = "response")
  emm <- list(emm_emmeans, emm_contrasts)
  
  if (save_emmeans_output) {
    saveRDS(emm, file.path(save_dir, "emmeans_output.rds"))
  }
  
  # Store information into a data frame
  lst <- lapply(emm, function(x) {
    tibble(x) %>% 
      pivot_longer(-1, names_to = "statistic") %>% 
      rename(contrast = 1)
  })
  df <- do.call("rbind", lst) %>% as.data.frame() %>% 
    mutate(
      statistic = recode(
        statistic,
        odds.ratio = "OR",
        p.value = "OR.pvalue",
        asymp.UCL = "OR.95pct.ci.upper",
        asymp.LCL = "OR.95pct.ci.lower"
      )
    ) %>% 
    separate(
      contrast, c("var", "contrast"), sep = " / ", fill = "right", remove = FALSE
    )
  
  return(df)
  
}

This is the code for running with the sample data I shared:

data <- read_csv("sample_data.csv")
data <- data %>%
  mutate(
    grouping_id = individual_id,
    pool_id = factor(pool_id),
    group_id = factor(group_id, ordered = TRUE),
  )

masc_out <- masc_modified(
 dataset = data,
 cluster = data$labels,
 contrast = "group_id",
 random_effects = "individual_id",
 fixed_effects = "pool_id",
 verbose = FALSE,
 save_models = TRUE,
 save_model_dir = ".",
 n_cores = 1,
 p.adjust_method = "BH"
)

About lmerTest, from my understanding I can't use it for binomial glmm, also I don't think it gives pairwise comparisons but it's in my to do list actually to test it out for certain use cases just needing LMM.

Let me know if you need more information, thank you!!!

Best, Liezel

liezeltamon avatar May 19 '25 13:05 liezeltamon

Thank you very much for the clarifying information!

I can confirm that this is currently not possible with GPBoost out-of-the-box. It would, however, be a nice add-on for GPBoost whether by (i) creating GPBoost internal functions that do the same thing or (ii) adding support for GPBoost to the emmeans package.

To be honest, I do not plan to work on this in the near future, but contributions are very welcome.

fabsig avatar Jul 10 '25 13:07 fabsig