scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior

Open jberkh opened this issue 10 months ago • 1 comments

Please make sure these conditions are met

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of scanpy.
  • [x] (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

When doing HVG selection with batch_key = "something" and subset = True, I noticed unexpected genes to be selected in the subset anndata. Upon further investigation, it seems that somehow the inplace subsetting goes wrong. (Though I checked the source code and could not find any issue that may explain it there.)

Minimal code sample

import scanpy as sc
import numpy as np
np.random.seed(0)

# Get AnnData
adata = sc.datasets.pbmc3k()
adata.layers['counts'] = adata.X.copy().tocsr()
adata.obs["Age"] = np.random.randint(0, 6, (2700,))
adata.obs["Age"] = adata.obs["Age"].astype('category')

# Filter genes, preprocess
sc.pp.filter_genes(adata, min_counts = 10)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

# Subset = False
ad_nosub = adata.copy()
sc.pp.highly_variable_genes(ad_nosub, n_top_genes = 1000, batch_key = "Age", subset = False)

# Subset = False, manual subset afterwards
ad_nosub_subbed = ad_nosub.copy()
ad_nosub_subbed._inplace_subset_var(ad_nosub_subbed.var["highly_variable"].to_numpy())

# Subset = True
ad_sub = adata.copy()
sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True)

Error output

>>> # As expected
>>> print(np.sum(ad_nosub.var["highly_variable"]))
1000
>>> 
>>> # As expected
>>> print(np.sum(ad_nosub_subbed.var["highly_variable"]))
1000
>>> 
>>> # Not as expected
>>> print(np.sum(ad_sub.var["highly_variable"]))
101

Versions

→ conda list | grep scanpy
scanpy                    1.10.1             pyhd8ed1ab_0    conda-forge

→ conda list | grep anndata
anndata                   0.10.7             pyhd8ed1ab_0    conda-forge

jberkh avatar Apr 24 '24 10:04 jberkh

Hey, thanks a lot for spotting & the nice reproducible example! Big help.

From my first look into it, it seems this is a bug indeed;

Bug appearing when

  • batch_key is not None and
  • flavor is “seurat” or “cell_ranger” and
  • then using subset=True.

Other cases are not suffering from this it seems (e.g. flavor="seurat_v3", or when batch_key=None).

Issue It appears in the cases describe above, subset=True will cause the first n_top_genes many genes of adata.var to be used as selection: not the actual n_top_genes highly variable genes.

Fix is on the way: I'll follow up here.

Your Example Reveals that sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True) suffers from this.

Circumvent bug For now, I recommend not using subset=True if the cases above hold for you: Rather, use

adata_subset = adata[:, adata.var["highly_variable"]]

when subsetting: which is basically the "subsetting afterwards" strategy.

eroell avatar May 02 '24 14:05 eroell