batch_key of highly_variable_genes not working
Hey!
I'm currently having an issue with the batch_key functionality of sc.pp.highly_variable_genes introduced in https://github.com/theislab/scanpy/pull/622 by @gokceneraslan.
If I try setting batch_key, I get a TypeError:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
----> 1 sc.pp.highly_variable_genes(adata, batch_key="sample", flavor='cell_ranger', n_top_genes=4000, inplace=False)
~/.conda/envs/sc-tutorial/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace, batch_key)
336 dtypes.append([('highly_variable_nbatches', int),
337 ('highly_variable_intersection', np.bool_)])
--> 338 return np.rec.fromarrays(arrays, dtype=dtypes)
~/.conda/envs/sc-tutorial/lib/python3.7/site-packages/numpy/core/records.py in fromarrays(arrayList, dtype, shape, formats, names, titles, aligned, byteorder)
606
607 if dtype is not None:
--> 608 descr = sb.dtype(dtype)
609 _names = descr.names
610 else:
TypeError: data type not understood
I have tried with multiple categorical columns and it worked with none of them.
It looks like this error occurs whenever batch_key is specified and inplace=False.
MCVE:
import scanpy as sc
pbmc = sc.datasets.pbmc68k_reduced()
pbmc.X = pbmc.raw.X # So we have reasonable values to calculate on
# These do not throw an error:
sc.pp.highly_variable_genes(pbmc, batch_key="phase")
sc.pp.highly_variable_genes(pbmc, inplace=False)
# This throws an error
sc.pp.highly_variable_genes(pbmc, batch_key="phase", inplace=False)
This does raise the question of what inplace=False should return for the batch case. I'd think a recarray (maybe this should change to a dataframe?) with metrics for each of the batches. You could end up with columns with names like: means_{batch1}, dispersions_{batch1} etc.
@danielStrobl, what were you expecting this to return?
I've run into the same issue. I'm expecting it to return the a recarray with the fields that sc.pp.highly_variable_genes() would return without batch_key set. And additionally with the fields highly_variable_nbatches and highly_variable_intersection.
There is a further issue with this version of the function as well. If a batch has 0 variance for multiple genes, then the _highly_variable_genes_single_batch() function will not work on this. Thus, it would be good to have some sort of gene filtering before running the single batch versions. Everything that is filtered out, would then just get a nan dispersion value for that batch.