anndata Fix sc.pp.filter_genes implementation for h5py.SparseDataset

Right now sc.pp.filter_genes(adata.X, min_counts=10) works correctly for dense file-backed anndata since np.sum(adata.X) here is supported by h5py.Dataset (see https://github.com/h5py/h5py/blob/master/h5py/_hl/dataset.py#L762).

However the same call on a SparseDataset fails with the following error:

~/.anaconda3/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py in filter_genes(data, min_counts, min_cells, max_counts, max_cells, inplace, copy)
    230         number_per_gene = number_per_gene.A1
    231     if min_number is not None:
--> 232         gene_subset = number_per_gene >= min_number
    233     if max_number is not None:
    234         gene_subset = number_per_gene <= max_number

TypeError: '>=' not supported between instances of 'SparseDataset' and 'int'

because np.sum(adata.X) weirdly returns a SparseDataset without reducing any dimensions:

I can think of two solutions:

Set the __array_ufunc__ method of SparseDatset (https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#numpy.class.array_ufunc) to None which would state explicitly that numpy operations are not supported. Next, we can print a proper error in filter_cells/filter_genes which states such operations are not supported.
Implement a sum function (which loads the data into memory) which would enable us to perform gene/cell filtering options. We can print a warning here and inform the user that data is loaded to the memory. Something like following works:

def sum(self, *args, **kwargs):
    return self.value.sum(*args, **kwargs)

Jul 08 '19 22:07 gokceneraslan

That sum behavior is real weird. I'm pretty surprised numpy doesn't throw an error about that one. I'd go as far as to rename this issue to be about that, since I know filtering sparse backed datasets probably won't work due to how subsetting is defined (it has a call to .copy(), which should throw an error for any backed dataset).

Did calling sum work before we had the big scipy update?

Jul 15 '19 06:07 ivirshup

See also theislab/scanpy#650

Aug 03 '20 07:08 flying-sheep

This operation is poorly defined for our current backed mode.

What would the expected result be? Does it return a view in backed mode? Do we need to write a new copy of the object somewhere?

Aug 03 '20 07:08 ivirshup

For the time being, we should at least raise a better error message than the technical one people get now. Maybe just raise NotImplementedError('filter_genes does not yet(?) work in backed mode')

Aug 03 '20 08:08 flying-sheep

Maybe with your error you can give the suggestion mentioned here: https://github.com/theislab/scanpy/issues/650#issuecomment-499492569

Aug 03 '20 13:08 rbf22

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

Jul 13 '23 02:07 github-actions[bot]