Fix sc.pp.filter_genes implementation for h5py.SparseDataset
Right now sc.pp.filter_genes(adata.X, min_counts=10) works correctly for dense file-backed anndata since np.sum(adata.X) here is supported by h5py.Dataset (see https://github.com/h5py/h5py/blob/master/h5py/_hl/dataset.py#L762).
However the same call on a SparseDataset fails with the following error:
~/.anaconda3/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py in filter_genes(data, min_counts, min_cells, max_counts, max_cells, inplace, copy)
230 number_per_gene = number_per_gene.A1
231 if min_number is not None:
--> 232 gene_subset = number_per_gene >= min_number
233 if max_number is not None:
234 gene_subset = number_per_gene <= max_number
TypeError: '>=' not supported between instances of 'SparseDataset' and 'int'
because np.sum(adata.X) weirdly returns a SparseDataset without reducing any dimensions:

I can think of two solutions:
-
Set the
__array_ufunc__method of SparseDatset (https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#numpy.class.array_ufunc) toNonewhich would state explicitly that numpy operations are not supported. Next, we can print a proper error in filter_cells/filter_genes which states such operations are not supported. -
Implement a
sumfunction (which loads the data into memory) which would enable us to perform gene/cell filtering options. We can print a warning here and inform the user that data is loaded to the memory. Something like following works:
def sum(self, *args, **kwargs):
return self.value.sum(*args, **kwargs)

That sum behavior is real weird. I'm pretty surprised numpy doesn't throw an error about that one. I'd go as far as to rename this issue to be about that, since I know filtering sparse backed datasets probably won't work due to how subsetting is defined (it has a call to .copy(), which should throw an error for any backed dataset).
Did calling sum work before we had the big scipy update?
See also theislab/scanpy#650
This operation is poorly defined for our current backed mode.
What would the expected result be? Does it return a view in backed mode? Do we need to write a new copy of the object somewhere?
For the time being, we should at least raise a better error message than the technical one people get now. Maybe just raise NotImplementedError('filter_genes does not yet(?) work in backed mode')
Maybe with your error you can give the suggestion mentioned here: https://github.com/theislab/scanpy/issues/650#issuecomment-499492569
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!