diffxpy icon indicating copy to clipboard operation
diffxpy copied to clipboard

Exaggerated estimates and low standard deviation

Open Hrovatin opened this issue 3 years ago • 8 comments

Sometimes I get unexpectedly high estimates with very small standard deviation, see below. These estimates are not in accordance to expected lFC, but are significant.

(Image: fitted log2fc is equal to coef mle, colour scale is viridis - yellow is large, violet is small) image Significant estimates with very high absolute lFC all have sd at single number, may be some sort of machine/statistical method precision? Is this the same for you @cdedonno ?

# All significant genes with very high abs lFC as shown on the image above have sd at same number.

summaries['16 d'].query('qval< 0.1 & abs(log2fc) > 10').coef_sd.unique()
array([2.22275875e-162])

Hrovatin avatar Mar 01 '21 20:03 Hrovatin

Hi @Hrovatin! Yeah, as we discussed last friday, this seems to be indeed a very similar issue to the one I presented. You also have these stack of genes sitting at very high (or very low) lFC values. Have you checked if these genes are lowly expressed?

cdedonno avatar Mar 02 '21 10:03 cdedonno

If you want to look into whether this is related to inversion of the fisher information matrix for the wald test, compare -model._hessian (https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L544) and model._fisher_inv (https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L548)

davidsebfischer avatar Mar 02 '21 10:03 davidsebfischer

There could also be numerical differences between the single coefficient wald test https://github.com/theislab/diffxpy/blob/c94e60a51a818363c57b7f112417dcb9bb96fe37/diffxpy/stats/stats.py#L181 and the generalisation to more parameters https://github.com/theislab/diffxpy/blob/c94e60a51a818363c57b7f112417dcb9bb96fe37/diffxpy/stats/stats.py#L221.

davidsebfischer avatar Mar 02 '21 10:03 davidsebfischer

Regarding weird lfc being at mainly unexpressed genes (based on n_cells expressing a gene): This is not the case always for me. The above example: image Another example, but here I may have too many coefficients so the model is not fit well. image Another more realistic example, before and after filtering out genes with sd at 2e-162 image image

N cells for genes with qval<0.05 and abs(lfc)>20: gene
Gm15462        33
Sypl2          38
Adgrg6         40
Gm12195        40
Cdc34b        123
Igf2bp1       309
AC163633.2     35
dtype: int64
Mean expr in gene-expr cells for genes with qval<0.05 and abs(lfc)>20: gene
Gm15462       1.058740
Sypl2         0.947558
Adgrg6        1.208749
Gm12195       1.105990
Cdc34b        0.919376
Igf2bp1       1.024707
AC163633.2    0.956821
dtype: float64

As you can see I have left a couple significant genes with high abs(lfc) that could not be filtered out based on mean expression in cells expressing the gene or based on n_cells as such high threshold would remove majority of genes (now I use threshold of gene being expressed in at least 30 cells and I have ~15000 genes left).

EDIT: Never mind the dotplot that was here before - I plotted it without taking into account the confounders.

Hrovatin avatar Mar 02 '21 20:03 Hrovatin

@davidsebfischer How should I compare the Hessian and Fisher info matrices?

Hrovatin avatar Mar 03 '21 09:03 Hrovatin

I d just collect a couple of genes that look weird and than look for signs of potential numeric instability in the matrices! Small values, large values etc.

davidsebfischer avatar Mar 03 '21 09:03 davidsebfischer

Now I managed to get to a point where sd filtering (with what seems to be min possible sd value (across different tests)) produces ok results (I also improved design and increased n_cells gene filtering (now approx ~1% of population should express the gene)). So from my perspective we can close the issue, although sd filtering could be added to diffxpy if this is a common issue and min sd stays the same across platforms.

Hrovatin avatar Mar 03 '21 18:03 Hrovatin

This occurs for me when I have only categorical covariates but usually not if I have continuous covariates.

Hrovatin avatar Apr 27 '21 06:04 Hrovatin