anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Support for sparse matrices that do not default to 0 but NaN instead

Open Zethson opened this issue 2 years ago • 1 comments

Please describe your wishes and possible alternatives to achieve the desired result.

Hi,

this issue solely serves as a discussion basis for now because I think that scipy sparse matrices and pretty much any other implementation that I know (correct me if I'm wrong !!!) default to 0 as the implicit sparse value.

For ehrapy it would be very useful to also support NaN as the default value for sparse matrices. 0s have a meaning in EHR data. So do NaNs, but this is a much harder problem to solve and is up to the data collectors (we should eventually differentiate between informed and uninformed NaNs - but this is not relevant here).

Adding support for this would probably be a monumental effort that would require adding support in Scipy sparse arrays and adapting implementations in scanpy. Before doing anything, I'd like to hear what people think.

Discussed in the past with @ivirshup at the Theislab retreat.

Zethson avatar Nov 17 '23 11:11 Zethson

Also discussed in person, but:

  • scipy sparse arrays support explicit zeros, so you can always just "pretend" the non-explicit zeros are nan
  • pydata/sparse does support alternative missing values, like nan
  • I haven't extensively used a sparse library that does "proper" support for nan as the sparse value. The issue is that nan behaves differently than zero, but most of the compiled code expects the missing value to be 0, so doesn't do nan propagation correctly
    • For example: https://github.com/pydata/sparse/issues/340
    • graphblas could probably do this correctly

What fraction of entries in your matrices should be nan? If under 90% is nan, then there may not be much to gain from a sparse matrix at all and a masked array may be more appropriate.

ivirshup avatar Dec 11 '23 14:12 ivirshup