scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

sc.pp.highly_variable does not always handle `subset` and `inplace` arguments consistently

Open jlause opened this issue 3 years ago • 1 comments

Hey, while writing tests for #1715 I noted the following behavior:

output = sc.pp.highly_variable(adata,inplace=True,subset=False,n_top_genes=100) --> Returns nothing :heavy_check_mark: --> adata.var fields are updated but shape stays the same :heavy_check_mark:

output = sc.pp.highly_variable(adata,inplace=True,subset=True,n_top_genes=100) --> Returns nothing :heavy_check_mark: --> adata.var fields are updated and shape is subsetted correctly :heavy_check_mark:

output = sc.pp.highly_variable(adata,inplace=False,subset=False,n_top_genes=100) --> output is a dataframe with the original number of genes as rows :heavy_check_mark: --> adata is unchanged :heavy_check_mark:

output = sc.pp.highly_variable(adata,inplace=False,subset=True,n_top_genes=100) --> Returns nothing :x: --> adata shape is changed an var fields are updated :x:

I think the last case is unexpected, assuming that inplace=False should protect adata from changes and results in an output. I think these parts of code are the cause (depending on the flavor):

https://github.com/theislab/scanpy/blob/4dd8de9e355ce8d59009c15522670a3d0970462c/scanpy/preprocessing/_highly_variable_genes.py#L148-L174

https://github.com/theislab/scanpy/blob/4dd8de9e355ce8d59009c15522670a3d0970462c/scanpy/preprocessing/_highly_variable_genes.py#L528-L553

To fix this, one could change if subset or inplace: to if inplace:. Then one would need to add another if subet: in the else block, like so (half-pseudocode):

    if inplace:        
        
       #update adata
        
        if batch_key is not None:
           #drop batch related keys
        if subset:
            adata._inplace_subset_var(df['highly_variable'].values)
    else:
        if batch_key is None:
            #drop batch related keys
        if subset:            
            df=df.iloc[df.highly_variable.values,:]
            
        return df

I can make a quick PR if this is the intended behavior. :slightly_smiling_face: best, jan.

jlause avatar Jun 09 '21 17:06 jlause

PS: Minimal example:

adata = sc.datasets.pbmc3k()
sc.pp.filter_genes(adata,min_cells=1)
print(adata.shape)
output = sc.pp.highly_variable_genes(adata,flavor='cell_ranger',inplace=False,subset=True)
print(adata.shape)
print(output)

jlause avatar Jun 09 '21 17:06 jlause

Nice catch - agree on all points regarding inconsistency, the causing sections & the solution with @jlause.

Made the PR implementing the "half-pseudocode" and added tests which catch your described unexpected behavior for all flavor.

eroell avatar Nov 16 '23 14:11 eroell