glmGamPoi icon indicating copy to clipboard operation
glmGamPoi copied to clipboard

pseudobulk_by apparently dropping columns

Open malcook opened this issue 3 years ago • 1 comments

Hi, I am successfully using your package with scRNA-Seq data, but finding a possible bug when trying out the pseudobulk_by option. Below I show it working fine without pseudobulk_by and then failing when I add it in:

s2<-as.SingleCellExperiment(s)
design<- model.matrix(~ 0 + experimentGroup,s2@colData)
glmGamPoi.fit <- glm_gp(s2
                       ,design = design
                       ,size_factors = "deconvolution"
                       ,on_disk = FALSE
                        )

This worked fine. My grouping factor and design matrix looks like this:

 unique(s2@colData$experimentGroup)
 [1] LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 Levels: LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600

 head(design)
                        LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 homeo_AAACCTGAGAAGCCCA        1        0        0        0        0        0        0
 homeo_AAACCTGAGGAGCGTT        1        0        0        0        0        0        0
 homeo_AAACCTGCAACTGCGC        1        0        0        0        0        0        0
 homeo_AAACCTGCAAGTAGTA        1        0        0        0        0        0        0
 homeo_AAACCTGCATATGAGA        1        0        0        0        0        0        0
 homeo_AAACCTGGTGCAACTT        1        0        0        0        0        0        0

Testing for DGE with the following contrast works fine:

contrast=c('LL_N_000-LL_H_000', 'LL_N_030-LL_H_000', 'LL_N_060-LL_H_000', 'LL_N_180-LL_H_000', 'LL_N_300-LL_H_000','LL_N_600-LL_H_000')
contrast<-limma::makeContrasts(levels=l,contrasts=contrast)

glmGamPoi.test <- test_de(glmGamPoi.fit
                        , contrast = contrast
                         ##,pseudobulk_by='experimentGroup'
                          ) 

... the problem comes when I try to test using pseudobulk_by, as

glmGamPoi.bulk.test <- test_de(glmGamPoi.fit
                        , contrast = contrast
                         ,pseudobulk_by='experimentGroup'
                          ) %>% setDT

Error in handle_design_parameter(design, data, col_data, reference_level) :
  The model_matrix has more columns (7) than the there are samples in the data matrix (7 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix:
         LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
LL_H_000        1        0        0        0        0        0        0
LL_N_000        0        1        0        0        0        0        0
LL_N_030        0        0        1        0        0        0        0
The head of the data:
      LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180
rpl24    21728    14048    13748    19649    14109
cep97       76       18       21        9       55
  eed      270       86       96       88      257

However by my calculations the aggregated data columns should d be there and are somehow erroneously being dropped, , viz::

 library(DelayedArray)
 grs<-colsum(s2@assays@data$counts,s2@colData$experimentGroup)
 head(grs)
         LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 rpl24      21728    14048    13748    19649    14109    11293    11562
 cep97         76       18       21        9       55       32       35
 eed          270       86       96       88      257      176      101
 hikeshi      459       98      120      137      458      380      144
 tmem39a      296      134       68       82      132       63       96
 ildr1a       746      471      620      460      435      261      342

I could potentially share the s2 object via ftp if it would help you sleuth this.

Or perhaps you recognize some error on my part?

malcook avatar May 23 '21 05:05 malcook

I think the problem is that you use the experimentGroup to form the pseudobulk, but also want to use it as the design argument.

The manual way of forming the pseudo-bulk would look like:

splitter <- split(seq_len(ncol(s2)), s2$experimentGroup)
pseudobulk_mat <- do.call(cbind, lapply(splitter, function(idx){
    matrixStats::rowSums2(assay(s2), cols = idx)
}))
pseudo_design_mat <- do.call(rbind, lapply(splitter, function(idx){
    matrixStats::colMeans2(glmGamPoi.fit$model_matrix, rows = idx)
 }))

If you look at pseudobulk_mat, you will see that it has 7 columns. At the same time, you want to fit a model which also has 7 different parameters, which fails with the error message: Too few replicates / too many coefficients to fit model..

Best, Constantin

const-ae avatar May 26 '21 14:05 const-ae