DESeq2
DESeq2 copied to clipboard
DESeqDataSet() fails when the count matrix contains a value higher than .Machine$integer.max
Hi Mike,
when the count matrix contains a value larger than .Machine$integer.max
, DESeqDataSet()
fails because NA values are introduced when converting the matrix to integer mode. Seeing such a high count is not an unrealistc scenario (anymore), since I found the error by analyzing a recount3
data set:
library("recount3")
library("DESeq2")
library("tidyverse")
rse <- available_projects() %>%
filter(project == "SRP074739") %>%
create_rse()
dds <- DESeqDataSet(rse, design = ~1)
# Warning: NAs introduced by coercion to integer rangeError in validObject(.Object) :
# invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix
and just to show that it is related to the integer-mode conversion, we can run
sum(assay(rse) > .Machine$integer.max)
# 10
x <- c(.Machine$integer.max, .Machine$integer.max + 1)
mode(x) <- "integer"
x
# Warning: NAs introduced by coercion to integer range
# [1] 2147483647 NA
The "trouble-causing" counts are exceptionally high for this data set
assay(rse) %>%
as.vector() %>%
qplot(bins = 100) +
scale_x_log10(breaks = 10^(0:10)) +
scale_y_log10() +
labs(x = "count", y = "# of observations") +
geom_vline(xintercept = .Machine$integer.max, linetype = "dashed")
data:image/s3,"s3://crabby-images/cd632/cd632b465409cedd89134db66eb3279ea728307e" alt="image"
but the issue could be present for other data as well.
Counts are modeled as integers in the C++ fitDisp() function, but that's the key place that I see in the codebase.
@const-ae would glmGamPoi help or do you also model as integers?
I'm wondering if genes with samples with counts above a billion could be handled as a special case also, since it is going to be quite rare. E.g. you could just run glm.nb
on these or something, using the size factor offsets from DESeq2 and the fitted dispersion or its own estimated dispersion.
My C++ is templated to either use int
or double
depending on the type of the input matrix (e.g., https://github.com/const-ae/glmGamPoi/blob/master/src/beta_estimation.cpp#L431). But I have never tested it with gigantic counts, but it could be interesting to see what happens.
Ok, so I could potentially allow override of the int conversion and pass doubles to glmGamPoi for all estimation. That's cleaner than the alternative
A quick experiment seems to confirm that glmGamPoi
can handle counts which are larger than .Machine$integer.max
:
# Integers with values larger than 2^31-1 = 2,147,483,647 = 2.1 * 10^9 cannot be represented by ints
mat <- matrix(rpois(n = 5 * 100, lambda = 3 * 10^9), nrow = 5, ncol = 100)
any(mat > .Machine$integer.max)
#> [1] TRUE
rand <- rnorm(100)
fit <- glmGamPoi::glm_gp(mat, design = ~ rand)
sprintf("%.2e", exp(fit$Beta[1,1]))
#> [1] "3.00e+09"
Created on 2023-03-08 with reprex v2.0.2
However, I don't want to give any guarantees that the convergence is optimal under such circumstances.
Just a heads up, I plan to work on this side case, e.g. override integer checks so that one can use glmGamPoi within DESeq(), but it may not happen before release. But the 2014 fitType won't be updateable to handle these counts.
Just curious if this issue has been addressed in the latest version?
Not yet. But you can directly use glmGamPoi bypassing DESeq2 for the mean time. Or edgeR or limma.
I have to see if the proposal is possible without doing a lot of rewriting and we just don't have many datasets coming up with a gene with a count of billions for a single sample.
You should now be able to avoid integer conversion using DESeqDataSet()
with skipIntegerMode=TRUE
, followed by fitType="glmGamPoi"
> mat <- matrix(rnbinom(n = 5 * 100, mu = 3 * 10^9, size=1000), nrow = 5, ncol = 100)
> se <- SummarizedExperiment(list(counts = mat), colData=DataFrame(rand))
> dds <- DESeqDataSet(se, ~rand, skipIntegerMode = TRUE)
!> dds <- DESeq(dds, test="LRT", fitType="glmGamPoi", reduced=~1)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
using 'glmGamPoi' as fitType. If used in published research, please cite:
Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
Generalized Linear Models on Single Cell Count Data. Bioinformatics.
https://doi.org/10.1093/bioinformatics/btaa1009
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Fit reduced model
Calculate quasi likelihood ratio
Prepare results
@itamburi or @frederikziebell could you give devel branch a try?