scanpy batch_key of highly_variable

Hey!

I'm currently having an issue with the batch_key functionality of sc.pp.highly_variable_genes introduced in https://github.com/theislab/scanpy/pull/622 by @gokceneraslan.

If I try setting batch_key, I get a TypeError:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
----> 1 sc.pp.highly_variable_genes(adata, batch_key="sample", flavor='cell_ranger', n_top_genes=4000, inplace=False)

~/.conda/envs/sc-tutorial/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace, batch_key)
    336             dtypes.append([('highly_variable_nbatches', int),
    337                            ('highly_variable_intersection', np.bool_)])
--> 338         return np.rec.fromarrays(arrays, dtype=dtypes)

~/.conda/envs/sc-tutorial/lib/python3.7/site-packages/numpy/core/records.py in fromarrays(arrayList, dtype, shape, formats, names, titles, aligned, byteorder)
    606 
    607     if dtype is not None:
--> 608         descr = sb.dtype(dtype)
    609         _names = descr.names
    610     else:

TypeError: data type not understood

I have tried with multiple categorical columns and it worked with none of them.

Jul 30 '19 13:07 danielStrobl

It looks like this error occurs whenever batch_key is specified and inplace=False.

MCVE:

import scanpy as sc
pbmc = sc.datasets.pbmc68k_reduced()
pbmc.X = pbmc.raw.X  # So we have reasonable values to calculate on

# These do not throw an error:
sc.pp.highly_variable_genes(pbmc, batch_key="phase")
sc.pp.highly_variable_genes(pbmc, inplace=False)

# This throws an error
sc.pp.highly_variable_genes(pbmc,  batch_key="phase", inplace=False)

This does raise the question of what inplace=False should return for the batch case. I'd think a recarray (maybe this should change to a dataframe?) with metrics for each of the batches. You could end up with columns with names like: means_{batch1}, dispersions_{batch1} etc.

@danielStrobl, what were you expecting this to return?

Aug 01 '19 06:08 ivirshup

I've run into the same issue. I'm expecting it to return the a recarray with the fields that sc.pp.highly_variable_genes() would return without batch_key set. And additionally with the fields highly_variable_nbatches and highly_variable_intersection.

Sep 04 '19 15:09 LuckyMD

There is a further issue with this version of the function as well. If a batch has 0 variance for multiple genes, then the _highly_variable_genes_single_batch() function will not work on this. Thus, it would be good to have some sort of gene filtering before running the single batch versions. Everything that is filtered out, would then just get a nan dispersion value for that batch.

Sep 09 '19 11:09 LuckyMD

batch_key of highly_variable_genes not working