Support for sparse matrices that do not default to 0 but NaN instead
Please describe your wishes and possible alternatives to achieve the desired result.
Hi,
this issue solely serves as a discussion basis for now because I think that scipy sparse matrices and pretty much any other implementation that I know (correct me if I'm wrong !!!) default to 0 as the implicit sparse value.
For ehrapy it would be very useful to also support NaN as the default value for sparse matrices. 0s have a meaning in EHR data. So do NaNs, but this is a much harder problem to solve and is up to the data collectors (we should eventually differentiate between informed and uninformed NaNs - but this is not relevant here).
Adding support for this would probably be a monumental effort that would require adding support in Scipy sparse arrays and adapting implementations in scanpy. Before doing anything, I'd like to hear what people think.
Discussed in the past with @ivirshup at the Theislab retreat.
Also discussed in person, but:
- scipy sparse arrays support explicit zeros, so you can always just "pretend" the non-explicit zeros are
nan - pydata/sparse does support alternative missing values, like
nan - I haven't extensively used a sparse library that does "proper" support for
nanas the sparse value. The issue is thatnanbehaves differently than zero, but most of the compiled code expects the missing value to be0, so doesn't donanpropagation correctly- For example: https://github.com/pydata/sparse/issues/340
- graphblas could probably do this correctly
What fraction of entries in your matrices should be nan? If under 90% is nan, then there may not be much to gain from a sparse matrix at all and a masked array may be more appropriate.