glmGamPoi icon indicating copy to clipboard operation
glmGamPoi copied to clipboard

Too many DEGs?

Open KunFang93 opened this issue 2 years ago • 1 comments

Hi,

I am a fresh user of glmGamPoi and thanks for providing this awesome tool! I tried to identify DEGs in scRNA and successfully run the following code:

load('/Volumes/kfsd/GSE200903_new_combined_cluster_FibroSamples_2022-01-14.RData')
combine_cluster <- as.SingleCellExperiment(combined_cluster)

set.seed(1)
colData(combine_cluster)
# Take highly expressed genes and proper cells:
combine_cluster_subset <- combine_cluster[rowSums(counts(combine_cluster)) > 100, 
                                          which( ! is.na(combine_cluster$CellType) & combine_cluster$CellType 
                                                 %in% c("fibroblasts"))]

# add a stim2 based on stim
colData(combine_cluster_subset)$stim2 <- colsplit(colData(combine_cluster_subset)$stim, 
                                                  split = "_", names = c('cell', 'trt','time'))$trt
# Convert counts to dense matrix
counts(combine_cluster_subset) <- as.matrix(counts(combine_cluster_subset))
# Remove empty levels because glm_gp() will complain otherwise
colData(combine_cluster_subset)$stim2 <- as.factor(colData(combine_cluster_subset)$stim2)
colData(combine_cluster_subset)$stim2 <- relevel(colData(combine_cluster_subset)$stim2, ref = "normal")

fit <- glm_gp(combine_cluster_subset, design = ~ stim2 ,
              reference_level = "normal")
summary(fit)
colnames(fit$Beta)

# The contrast argument specifies what we want to compare
# We test the expression difference of stimulated and control T-cells
#
# There is no sample label in the colData, so we create it on the fly
# from `stim` and `ind` columns in colData(fit$data).
de_res <- test_de(fit, contrast = `stim2PDAC`, 
                  pseudobulk_by = paste0(stim2, "-", rownames(colData(combine_cluster_subset)))) 


# The large `lfc` values come from groups were nearly all counts are 0
# Setting them to Inf makes the plots look nicer
de_res$lfc <- ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc)

# Most different genes
de_res<-de_res[order(de_res$pval), ]

However, I identified 11752 DEGs out of 14537 genes and I found many genes has df1=1.

head(de_res)
             name pval adj_pval f_statistic df1      df2        lfc
35           Rpl7    0        0    3244.666   1 20005.31 -0.5195890
44           Pi15    0        0    2180.603   1 20005.31 -1.6835958
124         Il1r1    0        0    3883.918   1 20005.31  2.2249821
132          Fhl2    0        0    5068.944   1 20005.31  4.8176423
158           Gls    0        0    1635.089   1 20005.31  0.9511179
164 1700019D03Rik    0        0    3380.809   1 20005.31 -2.7193859

I wondered how could I address this problem? Thanks in advance!

Best, Kun

KunFang93 avatar Jun 21 '22 21:06 KunFang93

Hi KunFang,

thank you for reaching out and for the detailed description of your problem. From a brief look over your code, it seems fine, but that is difficult to say for certain without access to the input data. If you are concerned that the fitting is wrong, you could fit the model with DESeq2 or edgeR and compare if the results are similar.

Best, Constantin

const-ae avatar Jun 22 '22 06:06 const-ae