diffxpy icon indicating copy to clipboard operation
diffxpy copied to clipboard

Design matrix is not full rank (on example that works in edgeR)

Open grst opened this issue 5 years ago • 2 comments

Soneson et al, 2018, NMeth recommend to add the fraction of detected genes as a covariate to the linear model for DE testing:

~ group + n_genes

(instead of using a ZINB model (#131))

When I try add n_genes to my diffxpy model, it throws a ValueError:

> test = de.test.wald(
    data=counts.values.T,
    formula_loc="~ 1 + group + n_genes",
    factor_loc_totest=["group"],
    coef_to_test="group",
    sample_description=obs,
    batch_size=100,
    training_strategy="DEFAULT",
    dtype="float64",
    gene_names=counts.index
  )
ValueError: constrained design matrix is not full rank: 80 81

The same model runs fine with edgeR:

d <- DGEList(counts)
d <- suppressWarnings(edgeR::calcNormFactors(d))
design <- model.matrix(~ 1 + group + n_genes, data=obs)
d <- estimateDisp(d, design)
fit <- glmFit(d,design)
lrt <- glmLRT(fit, coef = 2)
pval <- lrt$table$PValue

Is that a bug? Or am I missing something? Thanks for your help!

Full repex

  • A reproducible example (Rmd notebook + sample data) is available at https://github.com/grst/benchmark-single-cell-de-analysis/tree/master/diffxpy_test. The example uses diffxpy v0.7.1 from PyPI.
  • An executed version of the notebook is here.
  • To run the notebook in my conda-environment, use nextflow run diffxpy_test.nf.

grst avatar Nov 11 '19 13:11 grst

Hi @grst, I assume n_genes is numeric? We enforce that numeric covariates are independently verified to be desired numeric and not by accident converted from one-hot categorical. This is done by adding the argument as_numeric=['n_genes'], documented here: https://diffxpy.readthedocs.io/en/stable/api/diffxpy.api.test.wald.html#diffxpy.api.test.wald. Thanks for the detailed issue!

davidsebfischer avatar Nov 11 '19 15:11 davidsebfischer

Thanks, that worked (with version 0.6.13)!

Completed at: 12-Nov-2019 13:50:06
Duration    : 5m 6s
CPU hours   : 0.1
Succeeded   : 1

With version 0.7.1 it runs for several hours and then fails with a LinAlgError: Singular matrix:

 Detailed traceback:
    File "<string>", line 11, in <module>
    File "/home/sturm/projects/2019/zinbwaveZinger/work/conda/diffxpy_test-bf5ddff8c39a3368b22c8ee2c4730f1c/lib/python3.7/site-packages/diffxpy/testing/tests.py", line 648, in wald
      **kwargs,
    File "/home/sturm/projects/2019/zinbwaveZinger/work/conda/diffxpy_test-bf5ddff8c39a3368b22c8ee2c4730f1c/lib/python3.7/site-packages/diffxpy/testing/tests.py", line 169, in _fit
      estim.train_sequence(training_strategy=training_strategy)
    File "/home/sturm/projects/2019/zinbwaveZinger/work/conda/diffxpy_test-bf5ddff8c39a3368b22c8ee2c4730f1c/lib/python3.7/site-packages/batchglm/train/numpy/base_glm/estimator.py", line 52, in train_sequence
      self.train(**training_strategy)
    File "/home/sturm/projects/2019/zinbwaveZinger/work/conda/diffxpy_test-bf5ddff8c39a3368b22c8ee2c4730f1c/lib/python3.7/site-packages/batchglm/train/numpy/base_glm/estimator.py", line 79, in train
      self.model.a_var = sel

grst avatar Nov 12 '19 15:11 grst