DESeq2 icon indicating copy to clipboard operation
DESeq2 copied to clipboard

DESeqDataSet() fails when the count matrix contains a value higher than .Machine$integer.max

Open frederikziebell opened this issue 1 year ago • 5 comments

Hi Mike,

when the count matrix contains a value larger than .Machine$integer.max, DESeqDataSet() fails because NA values are introduced when converting the matrix to integer mode. Seeing such a high count is not an unrealistc scenario (anymore), since I found the error by analyzing a recount3 data set:

library("recount3")
library("DESeq2")
library("tidyverse")

rse <- available_projects() %>%
  filter(project == "SRP074739") %>% 
  create_rse()

dds <- DESeqDataSet(rse, design = ~1)
# Warning: NAs introduced by coercion to integer rangeError in validObject(.Object) : 
#  invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix

and just to show that it is related to the integer-mode conversion, we can run

sum(assay(rse) > .Machine$integer.max)
# 10

x <- c(.Machine$integer.max, .Machine$integer.max + 1)
mode(x) <- "integer"
x
# Warning: NAs introduced by coercion to integer range
# [1] 2147483647    NA

The "trouble-causing" counts are exceptionally high for this data set

assay(rse) %>% 
  as.vector()  %>%  
  qplot(bins = 100) + 
  scale_x_log10(breaks = 10^(0:10)) +
  scale_y_log10() +
  labs(x = "count", y = "# of observations") +
  geom_vline(xintercept = .Machine$integer.max, linetype = "dashed")
image

but the issue could be present for other data as well.

frederikziebell avatar Mar 08 '23 09:03 frederikziebell

Counts are modeled as integers in the C++ fitDisp() function, but that's the key place that I see in the codebase.

@const-ae would glmGamPoi help or do you also model as integers?

I'm wondering if genes with samples with counts above a billion could be handled as a special case also, since it is going to be quite rare. E.g. you could just run glm.nb on these or something, using the size factor offsets from DESeq2 and the fitted dispersion or its own estimated dispersion.

mikelove avatar Mar 08 '23 14:03 mikelove

My C++ is templated to either use int or double depending on the type of the input matrix (e.g., https://github.com/const-ae/glmGamPoi/blob/master/src/beta_estimation.cpp#L431). But I have never tested it with gigantic counts, but it could be interesting to see what happens.

const-ae avatar Mar 08 '23 15:03 const-ae

Ok, so I could potentially allow override of the int conversion and pass doubles to glmGamPoi for all estimation. That's cleaner than the alternative

mikelove avatar Mar 08 '23 15:03 mikelove

A quick experiment seems to confirm that glmGamPoi can handle counts which are larger than .Machine$integer.max:

# Integers with values larger than 2^31-1 = 2,147,483,647 = 2.1 * 10^9 cannot be represented by ints
mat <- matrix(rpois(n = 5 * 100, lambda = 3 * 10^9), nrow = 5, ncol = 100)
any(mat > .Machine$integer.max)
#> [1] TRUE
rand <- rnorm(100)
fit <- glmGamPoi::glm_gp(mat, design = ~ rand)
sprintf("%.2e", exp(fit$Beta[1,1]))
#> [1] "3.00e+09"

Created on 2023-03-08 with reprex v2.0.2

However, I don't want to give any guarantees that the convergence is optimal under such circumstances.

const-ae avatar Mar 08 '23 15:03 const-ae

Just a heads up, I plan to work on this side case, e.g. override integer checks so that one can use glmGamPoi within DESeq(), but it may not happen before release. But the 2014 fitType won't be updateable to handle these counts.

mikelove avatar Mar 22 '23 19:03 mikelove

Just curious if this issue has been addressed in the latest version?

itamburi avatar Jul 18 '24 18:07 itamburi

Not yet. But you can directly use glmGamPoi bypassing DESeq2 for the mean time. Or edgeR or limma.

I have to see if the proposal is possible without doing a lot of rewriting and we just don't have many datasets coming up with a gene with a count of billions for a single sample.

mikelove avatar Jul 18 '24 20:07 mikelove

You should now be able to avoid integer conversion using DESeqDataSet() with skipIntegerMode=TRUE, followed by fitType="glmGamPoi"

 > mat <- matrix(rnbinom(n = 5 * 100, mu = 3 * 10^9, size=1000), nrow = 5, ncol = 100)
 > se <- SummarizedExperiment(list(counts = mat), colData=DataFrame(rand))
 > dds <- DESeqDataSet(se, ~rand, skipIntegerMode = TRUE)
!> dds <- DESeq(dds, test="LRT", fitType="glmGamPoi", reduced=~1)
 estimating size factors
 estimating dispersions
 gene-wise dispersion estimates
 using 'glmGamPoi' as fitType. If used in published research, please cite:
     Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
     Generalized Linear Models on Single Cell Count Data. Bioinformatics.
     https://doi.org/10.1093/bioinformatics/btaa1009
 mean-dispersion relationship
 final dispersion estimates
 fitting model and testing
 Fit reduced model
 Calculate quasi likelihood ratio
 Prepare results

mikelove avatar Jul 19 '24 20:07 mikelove

@itamburi or @frederikziebell could you give devel branch a try?

mikelove avatar Jul 29 '24 21:07 mikelove