scvi-tools `use_observed_lib_size=False` fails with NaN for some datasets

Hi,

I see that use_observed_lib_size=False fails with NaN for some datasets. I would like to use use_observed_lib_size=False to estimate the technical normalisation effect per cell - which does not necessarily match the total count per cell, especially in more complex data from multiple tissues and cell types. For one dataset, a few attempts to restart the notebook and run all cells help. For the same dataset, it also helps to reduce n_hidden 1024 -> 512. For a different dataset, even n_hidden=128 and n_latent=30 don't work. It would be hard to provide a reproducible example because I observed this issue using unpublished snRNA-seq data.

My guess would be that use_observed_lib_size=False is not particularly numerically stable. I observed a similar issue with other models when priors are selected in suboptimal ways. Which prior is used for this technical normalisation effect?

I generally use a batch-specific prior that regularises the model to keep the cell-specific normalisation y_c close to 1 (e.g. in cell2location package):

y_c ~ Gamma(a, a / y_e)
y_e ~ Gamma(10, 10)

which regularises y_e batch-specific normalisation effect to be close to 1, and regularises y_c cell-specific normalisation effect to be close to the average for each batch y_e using hyperparameter a.

Please let me know what you think is going on with this NaN use_observed_lib_size=False issue and what do you think about using more regularised priors.

Feb 11 '23 15:02 vitkl

Can you provide what the minimum of your library size (and maximum) is and whether this behavior is dependent on this? I have observed this when not filtering for min_counts. I am not sure the problem is the range or the absolute value. It might be not what you are searching for but could help figuring out the problem.

Feb 11 '23 17:02 canergen

This is why we switched the default. I imagine the problem exists because we use an exp to transform the log library size, where we could instead consider using softplus.

The prior is described here. It's designed so that $\ell_n$ is on the same scale as the observed library size

Feb 11 '23 17:02 adamgayoso

After reviewing this more carefully, I think

the prior l_sigma is an overestimate of the total variance that can be attributed to the technical effect. A potential fix (and an easy fix) would be to add a hyper parameter to allows the user to reduce variance prior using a simple weight.
Indeed it is possible that softplus would make the computation more stable. Is it easy to change to softplus exclusively for size factors? Is this the operation here https://github.com/scverse/scvi-tools/blob/library_stability/scvi/nn/_base_components.py#L290?
I assume that the encoder network size, n_hidden is the same both for z and l which in my example is a pretty large number. This would mean that the network is massively overparameterised. In my trials of amortizing inference for cell2location I observed that such 1d parameters need to be amortized with a much smaller network to achieve numerical stability and avoid loss increase (n_hidden=10). Is it possible to change n_hidden exclusively for this parameter? If yes I would like to try this and report results.

I think a combination of 1 and 3 could solve this.

I assume that when the library size is learnable, the biological expression is transformed to positive using softplus rather than softmax, correct?

Mar 13 '23 18:03 vitkl

This line https://github.com/scverse/scvi-tools/blob/library_stability/scvi/module/_vae.py#L218 means that softplus is used here https://github.com/scverse/scvi-tools/blob/library_stability/scvi/nn/_base_components.py#L406 only when use_size_factor_key == True. This doesn't make sense - it should be "softplus" if (use_size_factor_key or not use_observed_lib_size) else "softmax".

If you softmax transform gene expression prediction, it doesn't matter if the library size is estimated or "observed total count" - it has to match the same total count per cell.

Mar 14 '23 17:03 vitkl

only when use_size_factor_key == True. This doesn't make sense - it should be "softplus" if (use_size_factor_key or not use_observed_lib_size) else "softmax".

We need to preserve backwards compatibility. scVI was originally described with a latent library size and softmax transformation

Mar 14 '23 17:03 adamgayoso

Ok, I can define an option to use softplus but keep the rest the same.

Mar 14 '23 18:03 vitkl

scvi-tools scvi-tools copied to clipboard

`use_observed_lib_size=False` fails with NaN for some datasets

scvi-tools
scvi-tools copied to clipboard