[BUG] Dispersion estimation fails after 0.4.8
Describe the bug This bug came up in the process of generating this notebook for our recent preprint. As part of the workflow, we fit a simple model with two design factors. One design factor has two levels; the other has seventy.
In 0.4.8, the procedure takes about four minutes, the vast majority of it spent on fitting dispersions. The mean/dispersion curve looks OK.
In 0.5.1, the dispersion fitting fails.
To Reproduce
Notebook here. Test data here.
New version:
anndata==0.11.4
matplotlib==3.10.3
numpy==2.2.6
pandas==2.2.3
pydeseq2==0.5.1
scanpy==1.11.1
scipy==1.15.3
tqdm==4.67.1
python==3.13.2
Old version:
tqdm==4.67.1
scipy==1.11.4
scanpy==1.9.6
pydeseq2==0.4.8
pandas==2.2.3
numpy==1.26.2
matplotlib==3.8.2
anndata==0.10.3
python==3.9.7
Expected behavior
- Most immediately, the dispersion estimation should probably not fail, given that it works in an older version.
- If the dispersion fitting procedure changed at some point between 0.4.8 and 0.5.1, it would be good if this were documented. But the only entries I see relate to vst (unrelated) and refitting (also unrelated).
- I do not expect the dispersion computation to scale so unfavorably with the number of levels. The example shown here takes a few minutes for
dds_f, which has 70 levels, but over an hour fordds_m, which has a little over 200 levels. This is about a 20x increase in runtime for a 2.8x increase in the size of the design matrix.
Screenshots See above.
Desktop (please complete the following information):
- OS: Linux-5.10.93-87.444.amzn2.x86_64-x86_64-with-glibc2.26
Hi @gennadyFauna, thanks for reporting this behaviour.
After some investigation it seems like this issue appears starting from the version 0.4.10.
Even though the dispersion trend curve is different (it is flat because the parametric fitting procedure failed), I think the issue comes from the genewise ("estimated" on the plot) dispersions: in the lower plot, we can see that many genewise dispersions are set to the lower bound 10^-8, whereas it is not the case in the above plot.
I haven't been able to pinpoint why this happens so far :/
@BorisMuzellec, thank you for looking into it. On further investigation:
- The procedure fails even on the much smaller dataset
ad=ad[ad.obs.dataset.isin(['litvinukova'])], i.e., restricted to only the seven donors in thelitvinukovastudy. - However, in this smaller dataset, it does succeed when using the simpler design
design='~xist'rather thandesign='~xist+donor_id'. So perhaps something about the handling of the design matrix.
@BorisMuzellec, this is correct. MoM estimates, size factors, mu_hat. are identical, but genewise dispersions are different.
I have narrowed this down to two possible causes for discrepancies.
1. The dispersion estimation was non-invariant with respect to the column order.
Calling fit_alpha_mle (in the old version) with the old design matrix gives values that differ to those obtained when calling it with the new design matrix.
This seems a little odd since the design matrices are identical up to swaps of columns 3, 4, and 5.
2. The dispersion estimation nearly always converges.
Calling fit_alpha_mle (in the new version) actually gives identical results with both design matrices. But so far as I can tell, estimation converges for all but a couple of genes. In 0.4.8, convergence only happened for 24% of the genes.
So my working hypothesis is: fit_alpha_mle was slow and inconsistent, something changed to make it fast and consistent, but at the expense of not rejecting genes that could not be effectively fit. This populates the dispersion estimates with poor fits: https://github.com/owkin/PyDESeq2/blob/8503e91e3ac35f4d766d87143e21de629bfe302d/pydeseq2/dds.py#L784 i.e., failing quietly. This led to downstream failures to fit the dispersion curves because they are overwhelmed by poor quality fits.
This doesn't seem to be a PyDESeq2 issue, or rather I cannot see any difference between versions of fit_alpha_mle. The same 99.99% convergence also occurs for both design matrices when using BFGS.
Maybe a SciPy 1.11.4 vs 1.15.3 difference? This seems like an odd change, and very difficult to test because it requires swapping out SciPy. But I cannot think of any other explanation. Though this does not really clarify how a standard algorithm, given the same information, began to come up with different answers.
For the sake of being exhaustive, per https://docs.scipy.org/doc/scipy/release/1.15.0-notes.html#scipy-optimize-improvements the L-BFGS-B got ported from Fortran77 to C on version 1.15.0.
I did a deeper dive into the code to try to isolate the issue. At least part of the inconsistency is that the AnnData.X matrix I initially pass into DESeq2 is float32. In the newer version, the float32 is cast to int. But in the older, it remains float32. If the data are cast to int in the older version, the genewise dispersion estimates match the newer version. But the number of estimates set to 1e-8 remains concerning.
In the float32 version (but everything else held equal), the low-dispersion region is unstable. But the search incidentally terminates at the true higher-dispersion optimum a fair fraction of the time. I was not able to reproduce this instability on Colab.
In the int version, the low-dispersion region is stable but the gradient is too small to escape the initial guess.
Starting at the (naive) domain midpoint seems to produce better results in some cases. But not generally.
Somewhat surprisingly, L-BFGS-B tends to produce higher loss values than grid search. But I am not sure how indicative this is.
Working hypothesis: if the loss function is poorly behaved (e.g., due to obscure precision effects), the SciPy L-BFGS-B mostly gives up and PyDESeq2 reverts to grid search (which is reliable). But if the loss function is well-behaved, it risks getting trapped in a "local minimum" around 1e-8 (quotes because slightly increasing the dispersion value decreases the loss; the gradient is just small). So this may well just be a challenging optimization problem with no clear solution.
Working hypothesis: if the loss function is poorly behaved (e.g., due to obscure precision effects), the SciPy L-BFGS-B mostly gives up and PyDESeq2 reverts to grid search (which is reliable). But if the loss function is well-behaved, it risks getting trapped in a "local minimum" around
1e-8(quotes because slightly increasing the dispersion value decreases the loss; the gradient is just small). So this may well just be a challenging optimization problem with no clear solution.
Mind sharing the notebook you used to make these graphs? I'm curious about whether there is a SciPy optimization function that would work better than minimize().
@nickodell sure, here you go: https://github.com/Fauna-Bio/GG_2025/blob/pydeseq2_bug/pydeseq2/5_differential_expression_L-BFGS-B.ipynb
The h5ad files (same directory) are obtained by applying ad_f=ad_f[ad_f.obs.dataset.isin(['litvinukova'])].copy() after file import in the other notebook and running the respective pipelines (and stripping out anything in the resulting dds objects that cannot be saved).
It seems like there are some numerical stability issues in nb_nll and dnb_nll for very small values of alpha.
def dnb_nll(counts: np.ndarray, mu: np.ndarray, alpha: float) -> float:
alpha_neg1 = 1 / alpha
ll_part = (
alpha_neg1**2
* (
polygamma(0, alpha_neg1)
- polygamma(0, counts + alpha_neg1)
+ np.log(1 + mu * alpha)
+ (counts - mu) / (mu + alpha_neg1)
).sum()
)
return -ll_part
Specifically, it seems like polygamma(0, alpha_neg1) - polygamma(0, counts + alpha_neg1) is having a catastropic cancellation that significantly reduces precision around alpha=1e-12. (I am not sure if the author of this package would consider this a bug or if they would say that one should set min_disp higher to prevent this from being evaluated at this point.)
I have two ideas here:
- For large alpha_neg1, the expression
polygamma(0, alpha_neg1) - polygamma(0, counts + alpha_neg1)is nearly equivalent to-counts * polygamma(1, alpha_neg1). It is a good approximation for largealpha_neg, but breaks down if the derivative of polygamma changes significantly over the range [alpha_neg1, counts + alpha_neg1]. - The expression
np.log(1 + mu * alpha)can be rewritten asnp.log1p(mu * alpha). This can improve accuracy, because the expression1 + mu * alphacan round the result if1is much bigger thanmu * alpha. However, I did not observe that this actually improved stability.
I worked through the same exercise with nb_nll, and I don't have any bright ideas to improve it.
Hi @gennadyFauna and @nickodell,
Thank you so much for digging into this and sorry for my late reply (I'm just back from vacation). I'll need some time to fully process the discussion but here are a few remarks and questions from the top of my head:
-
The loss function is indeed not so well-behaved, in the sense that it is not convex and can have some very flat regions, as @gennadyFauna's plots show. This makes me think that changes in scipy's lbfgs-b implementation could have strong impacts.
-
Related to the above point: if it works, changing the initial guess could be a good fix. My guideline up to now was to reproduce DESeq2's (the R package) behavior as closely as possible, unless I had a very good reason not to. We would have to do careful testing and benchmarking before changing the initial value.
-
@nickodell thanks for investigating the numerical stability of
nb_nllanddnb_nll. These functions have some known stability issues indeed, but I'm not sure how to fix them. Thelog1pthink is a nice catch, I'm not sure it'll solve our problems but it can only help. As for the catastrophic cancellation issue, have you observed it in higher value regions? I'm asking this because if not, themin_disp=1e-8lower bound enforced throughout the software should be sufficient. -
Something to investigate: perhaps we could catch lbfgs-b silent failures by enforcing stricter convergence criteria (
ftolandgtol)? This might come at the price of overall slower convergence though.
3. thanks for investigating the numerical stability of
nb_nllanddnb_nll. These functions have some known stability issues indeed, but I'm not sure how to fix them. Thelog1pthink is a nice catch, I'm not sure it'll solve our problems but it can only help. As for the catastrophic cancellation issue, have you observed it in higher value regions? I'm asking this because if not, themin_disp=1e-8lower bound enforced throughout the software should be sufficient.
No, I couldn't find anything in that region. However, It's possible that changing the dataset causes numerical instability above 1e-8. I only tried the input dataset that OP posted.
4. Something to investigate: perhaps we could catch lbfgs-b silent failures by enforcing stricter convergence criteria (
ftolandgtol)? This might come at the price of overall slower convergence though.
My gut reaction is that it probably wouldn't help by itself.
My impression from the bug report in https://github.com/scipy/scipy/issues/23161 is this: the major difference between SciPy versions is the following:
- In both versions: L-BFGS-B evaluates the function and gradient at various points to find a good direction to perform line searches in.
- In both versions: L-BFGS-B uses the function and gradient evaluations to build a low-rank approximation of $H^-1$, the inverse hessian of your function.
- In both versions: The values inside $H^-1$ get bigger and bigger over time, which causes L-BFGS-B to take larger and larger steps.
- In old L-BFGS-B: The value of $H^-1$ is reset to the identity matrix for some poorly understood reason, resulting in smaller steps.
- In new L-BFGS-B: The value of $H^-1$ is not reset, and L-BFGS-B keeps trying larger and larger initial steps.
So my impression is that lower values for ftol and gtol would not help this corner case, because it doesn't address the value of $H^-1$. Instead, L-BFGS-B will exit with an abnormal exit code because the line search cannot find a lower value.
However, optimizing with low values of ftol/gtol, combined with restarting optimization with a new call to minimize(), would also reset $H^-1$, which might help.
Powell might be more reliable (i.e., match the grid search result and the visual optimum) for cases with flat likelihoods that nevertheless have a real minimum:
This seems to hold (though less reliably) for degenerate cases:
The results generally match grid search (n.b. also compare magnitude to the other method), although there are exceptions for unusually pathological loss landscapes.
This generally happens when grid_fit_alpha goes outside the min_disp to max_disp domain. See this line: https://github.com/owkin/PyDESeq2/blob/918715ce2c9dac327ccad08441136edc1ae60b6f/pydeseq2/grid_search.py#L134
Anyway, Powell might be a useful option to add for dispersion estimation. It would probably not work as a fallback because the issue is L-BFGS-B failing silently.
Is this failure also happening on Ubuntu box? I've tested the problematic case in the SciPy issue on Win10, Win11, Fedora and Mint and could not reproduce. I could only reproduce on Ubuntu.
So far it seems to me that the new version is using linked BLAS libraries and the old one used its own baked BLAS and LAPACK routines. As far as I can tell there are no other differences. So I'm inclined to blame it on that because I pretty much looked at them next to each other working and non-working where each line is debug printed.
Hi @gennadyFauna, this is interesting, we could indeed add Powell as an alternative non-default optimizer in https://github.com/owkin/PyDESeq2/blob/c209d3feb25fc625404739fba3ba70e1673b8651/pydeseq2/utils.py#L443
Feel free to open a PR!
Otherwise I'll try to give it a go but I can't guarantee when.