sparse icon indicating copy to clipboard operation
sparse copied to clipboard

`mask` property that returns a boolean mask of existing values

Open Hoeze opened this issue 4 years ago • 11 comments

Hi, I'm trying to get a boolean mask of all non-zero values in a sparse array. Does this library have this functionality? https://numpy.org/devdocs/reference/generated/numpy.ma.getmask.html#numpy.ma.getmask

Example:

# sparse representation of the following array:
# [[-1, -1],
#  [-1,  3]]
s = sparse.COO(coords=[[1, 1]], [3], shape=(2, 2), fill_value=-1)
s.mask

Expected result:

[[False, False],
 [False, True ]]

Hoeze avatar Jun 18 '20 14:06 Hoeze

You can easily do s != s.fill_value, and this will efficiently compute the mask as a COO array.

hameerabbasi avatar Jun 18 '20 14:06 hameerabbasi

My solution is O(nnz). If you wanted to do it as fast as possible (constant time), you can also do the following:

coords = s.coords
data = np.broadcast_to(True, coords.shape[1])
mask = sparse.COO(coords=coords, data=data, shape=s.shape)

hameerabbasi avatar Jun 18 '20 14:06 hameerabbasi

Thanks for your answer @hameerabbasi. My case is a bit special, because my fill value has a slightly different meaning than "non-existent". This is why I work with either masked arrays or sparse arrays:

def get_genotype(samples, variants, sparse=False, masked=True, fill_value=-1) -> Union[sparse.COO, np.ndarray, np.ma.masked_array]:
    coords, data = [...]

    num_samples = len(samples)
    num_variants = len(variants)
    sparse_array = sparse.COO(coords, data, shape=(num_variants, num_samples), fill_value=fill_value)
    
    if sparse_array:
        return(sparse_array)
    elif masked:
        the_mask = np.zeros((num_variants, num_samples), dtype=bool)
        the_mask[coords] = data

        masked_array = np.ma.array(
            data=sparse_array.todense()
            mask=the_mask,
            fill_value=fill_value
        )
        
        return masked_array
    else:
        return sparse_array.todense()

Hoeze avatar Jun 18 '20 14:06 Hoeze

you were 20s faster than me @hameerabbasi :smile: Yes, but it would be nice to have a general solution also for CSD :)

Hoeze avatar Jun 18 '20 14:06 Hoeze

PyData/Sparse's stated goal has always been NumPy compatibility for sparse arrays, (not masked, but ndarray). And CSD is experimental and not even released as public API yet.

If your request is for a .mask property, then please state so, however it may not be added to CSD (precisely because it's experimental). However, if a workaround is sufficient then please close this issue. 😉

hameerabbasi avatar Jun 18 '20 14:06 hameerabbasi

OK, thanks for the clarification.

My workaround is fine enough for now as I'm using sparse only as a wrapper for my COO data. I was not sure if sparse already has this feature implemented under another name (sth. like sparse.nonzero() with fill_value=None or exists()) and I did a useless workaround, so that's why I asked.

Still, I think it would be useful to have some masked-array functionality in sparse so I would like to change this to a feature request for a .mask property. Feel free to close the issue if you disagree :slightly_smiling_face:

Hoeze avatar Jun 18 '20 14:06 Hoeze

So it seems your desire is to merge the code so it no longer has two paths. I'm in close contact with the NumPy devs, and masked arrays will soon stop existing in their current form and instead be able to hold any duck array (as they're called).

Accepting a feature set may turn SparseArray into a chimera, which is bad is code, and bad for maintenance. It'd be much nicer to have a MaskedArray duck-array that can hold other such arrays. 😄

hameerabbasi avatar Jun 18 '20 15:06 hameerabbasi

Cool, thats great news :smile: A MaskedArray duck-array would help solving so many edge cases!

Is it planned to have some general interface that allows to implement MaskedArray-like types as well? This would keep SparseArray from being a chimera and preserve the advantages of the sparse representation (i.e. efficient mask calculation on demand).

Hoeze avatar Jun 18 '20 15:06 Hoeze

So the plan is that once you have a duck-array, it'll be easier to make a masked array from that duck array. As for making a duck array, there are many efforts for methods to make that easy. Stay tuned. 😉

If you agree that .mask is something that shouldn't go in PyData/Sparse, please close the issue. If you still feel it should be in, well. Let's leave it open for other interested parties, and given there's enough interest, it can be done.

hameerabbasi avatar Jun 18 '20 15:06 hameerabbasi

IMHO there should definitively be a general way to get the boolean mask of missing values. This is functionality that depends directly on the internal representation of the sparse array.

The final API of this functionality is something that should be discussed:

  • .mask
  • .nonzero(missing=True)
  • .missing()
  • .to_masked_array()
  • .__array_mask__()
  • np.as_masked_array(the_array)
  • [...]

Hoeze avatar Jun 18 '20 15:06 Hoeze

As the mask, in general, takes time to compute, I'd prefer .nonzero(missing=True) or .missing() as the shortest route. 😄

All others are misleading:

  • .mask is a property, implies no computation.
  • .to_masked_array() implies a np.ma.MaskedArray (which is soft-deprecated).
  • .__array_mask__() implies community consensus, which we don't yet have. If you want to work on this consensus, feel free!
  • np.as_masked_array(the_array) would require changes in NumPy.

I'd like to propose a sparse.get_mask(array)function in addition.

hameerabbasi avatar Jun 27 '20 09:06 hameerabbasi

Closing this as out of scope for the meantime.

hameerabbasi avatar Jan 05 '24 07:01 hameerabbasi