scvi-tools
scvi-tools copied to clipboard
[TOTALVI] NaN values during training when `empirical_protein_background_prior=True`
First reported in https://discourse.scverse.org/t/totalvi-nan-loss-with-few-proteins/45/3, this bug was recently reproduced in a dataset with 300 proteins. The cause of the issue is unknown, but setting empirical_protein_background_prior=False results in proper training.
Associated code: https://github.com/scverse/scvi-tools/blob/master/scvi/model/_totalvi.py#L1073
Versions:
v0.16.4
@jjhong922 @romain-lopez I'm unable to reproduce this error when using < 10 proteins.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scvi
import scanpy as sc
adata = scvi.data.pbmcs_10x_cite_seq()
n_proteins = 1 # can vary this number
adata.obsm['protein_expression'] = adata.obsm['protein_expression'].iloc[:, :n_proteins]
adata.layers["counts"] = adata.X.copy()
adata.raw = adata
sc.pp.highly_variable_genes(
adata,
n_top_genes=4000,
flavor="seurat_v3",
batch_key="batch",
subset=True,
layer="counts"
)
scvi.model.TOTALVI.setup_anndata(
adata,
protein_expression_obsm_key="protein_expression",
layer="counts",
batch_key="batch"
)
vae = scvi.model.TOTALVI(adata, latent_distribution="normal", empirical_protein_background_prior=True, n_layers_decoder = 2)
vae.train()
300 proteins sounds like a lot and I suspect many of them are not detected well. I would try filtering proteins first and then trying, it could be throwing off the heuristics used in this initialization
yes that sounds like a good advice, it'd be great to put this in the tutorial maybe?