awkward icon indicating copy to clipboard operation
awkward copied to clipboard

`ak.ravel` doesn't mirror `np.ravel` for `None` leaf values

Open agoose77 opened this issue 3 years ago • 3 comments

Version of Awkward Array

1.8.0

Description and code to reproduce

The bug :bug:

In #1459 I noticed that ak.ravel(masked_array) does not match np.ravel(masked_array) for a NumPy masked_array.

An example:

import awkward as ak
import numpy as np


array = np.ma.array(
    np.random.random(size=(10, 3)),
   mask=np.random.randint(0, 2, size=(10, 3)).astype(np.bool_)
)

assert np.ravel(array).tolist() == ak.ravel(array).tolist()

Why is this a bug? :question:

I think this is a problem, because we try to make NumPy functions on NumpyArrays behave as they do in regular NumPy. Given that we explicitly handle the masked_array type in our codebase (converting it to a ByteMaskedArray), I assert that we should also expect the same to happen for masked arrays.

What should we do here? :mag_right:

Whilst I think we do want to drop None values that are non-leaf (this level of structure is erased anyway, so keeping None values seems pointless), I don't think we necessarily want to lose those at the leaves unless we want to break NumPy compatibility.

The existing behaviour in ak.ravel is useful, though, particularly when using histograms. So, I think we might want this to be an option to ak.ravel, e.g. ak.ravel(drop_nones=True).

I'd be interested in hearing other people's thoughts here.

agoose77 avatar May 05 '22 09:05 agoose77

If it's floating point, we can turn the missing values into nan. If they're integers... I don't think we want to put in maxint or anything. We also don't want to change types to float.

Maybe we should take fill_value as an argument. We can add new arguments to the NumPy functions we overload. The default fill_value could do nan/maxint and allow users to put in something else.

jpivarski avatar May 05 '22 17:05 jpivarski

That's an idea! I also wonder if we want to emulate NumPy here? If so, we would want to only eliminate None at the outermost lists, and keep the innermost option type.

agoose77 avatar May 05 '22 18:05 agoose77

Oh wait, you mean this:

>>> np.ravel(np.ma.MaskedArray([1, 2, 3, 4, 5], [False, True, True, True, False]))
masked_array(data=[1, --, --, --, 5],
             mask=[False,  True,  True,  True, False],
       fill_value=999999)

I was under the (wrong) impression that np.ravel is removing the mask but leaving in the maskedarray.data, which for us would be uninitialized junk, so we'd have to set it to something. But that's not what np.ravel does: it leaves the option-type in.

Technically, the option-type of a MaskedArray is immediately above the primitive type, regardless of whether that's at top-level or not.

Leaving that option-type would not be good for people who want to use np.ravel to flatten data for a plot. We'd have to go back to recommending ak.flatten with axis=None.

MaskedArray's semantics are not as well-established as the rest of NumPy's. (See how often array library designers, following NumPy, emulate np.ma.MaskedArray as well as np.ndarray.) I don't think it's as necessary for us to follow what MaskedArray does as what core NumPy does. For instance, our missing values are not np.ma.masked or similar.

I'm not really sure. I don't have a strong sense of what users of np.ravel would expect in a situation like this.

jpivarski avatar May 05 '22 18:05 jpivarski