scanpy
scanpy copied to clipboard
sc.pp.highly_variable does not always handle `subset` and `inplace` arguments consistently
Hey, while writing tests for #1715 I noted the following behavior:
output = sc.pp.highly_variable(adata,inplace=True,subset=False,n_top_genes=100)
--> Returns nothing :heavy_check_mark:
--> adata.var
fields are updated but shape stays the same :heavy_check_mark:
output = sc.pp.highly_variable(adata,inplace=True,subset=True,n_top_genes=100)
--> Returns nothing :heavy_check_mark:
--> adata.var
fields are updated and shape is subsetted correctly :heavy_check_mark:
output = sc.pp.highly_variable(adata,inplace=False,subset=False,n_top_genes=100)
--> output is a dataframe with the original number of genes as rows :heavy_check_mark:
--> adata
is unchanged :heavy_check_mark:
output = sc.pp.highly_variable(adata,inplace=False,subset=True,n_top_genes=100)
--> Returns nothing :x:
--> adata
shape is changed an var
fields are updated :x:
I think the last case is unexpected, assuming that inplace=False
should protect adata
from changes and results in an output. I think these parts of code are the cause (depending on the flavor):
https://github.com/theislab/scanpy/blob/4dd8de9e355ce8d59009c15522670a3d0970462c/scanpy/preprocessing/_highly_variable_genes.py#L148-L174
https://github.com/theislab/scanpy/blob/4dd8de9e355ce8d59009c15522670a3d0970462c/scanpy/preprocessing/_highly_variable_genes.py#L528-L553
To fix this, one could change if subset or inplace:
to if inplace:
. Then one would need to add another if subet:
in the else
block, like so (half-pseudocode):
if inplace:
#update adata
if batch_key is not None:
#drop batch related keys
if subset:
adata._inplace_subset_var(df['highly_variable'].values)
else:
if batch_key is None:
#drop batch related keys
if subset:
df=df.iloc[df.highly_variable.values,:]
return df
I can make a quick PR if this is the intended behavior. :slightly_smiling_face: best, jan.
PS: Minimal example:
adata = sc.datasets.pbmc3k()
sc.pp.filter_genes(adata,min_cells=1)
print(adata.shape)
output = sc.pp.highly_variable_genes(adata,flavor='cell_ranger',inplace=False,subset=True)
print(adata.shape)
print(output)
Nice catch - agree on all points regarding inconsistency, the causing sections & the solution with @jlause.
Made the PR implementing the "half-pseudocode" and added tests which catch your described unexpected behavior for all flavor
.