scanpy
scanpy copied to clipboard
Unexplainable sc.pp.highly_variable_genes(subset = True) behavior
Please make sure these conditions are met
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of scanpy.
- [x] (optional) I have confirmed this bug exists on the main branch of scanpy.
What happened?
When doing HVG selection with batch_key = "something" and subset = True, I noticed unexpected genes to be selected in the subset anndata. Upon further investigation, it seems that somehow the inplace subsetting goes wrong. (Though I checked the source code and could not find any issue that may explain it there.)
Minimal code sample
import scanpy as sc
import numpy as np
np.random.seed(0)
# Get AnnData
adata = sc.datasets.pbmc3k()
adata.layers['counts'] = adata.X.copy().tocsr()
adata.obs["Age"] = np.random.randint(0, 6, (2700,))
adata.obs["Age"] = adata.obs["Age"].astype('category')
# Filter genes, preprocess
sc.pp.filter_genes(adata, min_counts = 10)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
# Subset = False
ad_nosub = adata.copy()
sc.pp.highly_variable_genes(ad_nosub, n_top_genes = 1000, batch_key = "Age", subset = False)
# Subset = False, manual subset afterwards
ad_nosub_subbed = ad_nosub.copy()
ad_nosub_subbed._inplace_subset_var(ad_nosub_subbed.var["highly_variable"].to_numpy())
# Subset = True
ad_sub = adata.copy()
sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True)
Error output
>>> # As expected
>>> print(np.sum(ad_nosub.var["highly_variable"]))
1000
>>>
>>> # As expected
>>> print(np.sum(ad_nosub_subbed.var["highly_variable"]))
1000
>>>
>>> # Not as expected
>>> print(np.sum(ad_sub.var["highly_variable"]))
101
Versions
→ conda list | grep scanpy
scanpy 1.10.1 pyhd8ed1ab_0 conda-forge
→ conda list | grep anndata
anndata 0.10.7 pyhd8ed1ab_0 conda-forge
Hey, thanks a lot for spotting & the nice reproducible example! Big help.
From my first look into it, it seems this is a bug indeed;
Bug appearing when
-
batch_key
is notNone
and -
flavor
is“seurat”
or“cell_ranger”
and - then using
subset=True
.
Other cases are not suffering from this it seems (e.g. flavor="seurat_v3"
, or when batch_key=None
).
Issue
It appears in the cases describe above, subset=True
will cause the first n_top_genes
many genes of adata.var
to be used as selection: not the actual n_top_genes
highly variable genes.
Fix is on the way: I'll follow up here.
Your Example
Reveals that sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True)
suffers from this.
Circumvent bug
For now, I recommend not using subset=True
if the cases above hold for you:
Rather, use
adata_subset = adata[:, adata.var["highly_variable"]]
when subsetting: which is basically the "subsetting afterwards" strategy.