awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Convenience function to turn an Awkward Array into a NumPy array in anyway that it can

Open nikoladze opened this issue 5 years ago • 5 comments

Currently it seems a bit cumbersome to create a contiguous numpy array (after padding and filling - e.g. for input into ML models) from records with fields of different numeric types (e.g. int and float or float and double). I'm looking for a similar behaviour like .values or .to_numpy() in pandas:

>>> df = pd.DataFrame({"a" : [1, 2, 3], "b" : [1.1, 2.2, 3.3]})
>>> df.dtypes
a      int64
b    float64
dtype: object
>>> df.to_numpy()
array([[1. , 1.1],
       [2. , 2.2],
       [3. , 3.3]])
>>> df.to_numpy().dtype
dtype('float64')`

There are two obstacles when trying this with awkward:

  • When i call ak.fill_none this will result in a union type that can't be converted to numpy e.g.
>>> import awkward1 as ak
>>> array = ak.zip({"a" : [[1, 2], [], [3, 4, 5]], "b" : [[1.1, 2.2], [], [3.3, 4.4, 5.5]]})
>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> padded = ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
>>> padded
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> ak.type(padded)
3 * 2 * union[{"a": int64, "b": float64}, int64]
  • When i have a record that can be converted to numpy it will result in a structured numpy array which i will still have to cast to a consistent dtype for many ML applications

I believe @nsmith- also ran into this when trying to show the padding and filling features of awkward in his tutorial on NanoEvents yesterday.

Not sure how to best implement convenience functions for this, but maybe one could add extra options to ak.fill_none and ak.to_numpy roughly like the following (+figure out how to deal with nested records)

def new_fill_none(array, value, cast_value=False, **kwargs):
    if cast_value and len(ak.keys(array)) > 0:
        # having this as a fill value won't result in a union array
        value = {k : value for k in ak.keys(array)}
    return ak.fill_none(array, value, **kwargs)

def new_to_numpy(array, consistent_dtype=None, **kwargs):
    np_array = ak.to_numpy(array, **kwargs)
    if consistent_dtype is not None:
        if len(ak.keys(array)) == 0:
            raise ValueError("Can't use `consistent_dtype` when array has no fields")
        np_array = np_array.astype(
            [(k, consistent_dtype) for k in ak.keys(array)], copy=False
        ).view((consistent_dtype, len(ak.keys(array))))
    return np_array

>>> import awkward1 as ak
>>> array = ak.zip({"a" : [[1, 2], [], [3, 4, 5]], "b" : [[1.1, 2.2], [], [3.3, 4.4, 5.5]]})
>>> new_to_numpy(new_fill_none(ak.pad_none(array, 2, clip=True), 0, cast_value=True), consistent_dtype="float64")
array([[[1. , 1.1],
        [2. , 2.2]],

       [[0. , 0. ],
        [0. , 0. ]],

       [[3. , 3.3],
        [4. , 4.4]]])

nikoladze avatar Jul 14 '20 09:07 nikoladze

Just to piggyback, I feel like ak.pad is a well-deserved function that could combine the arguments of ak.pad_none and ak.fill_none.

nsmith- avatar Jul 14 '20 21:07 nsmith-

Isn't the fact that

In [9]: ak.fill_none(ak.pad_none(array.a, 2, clip=True), 0.)
Out[9]: <Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * float64'>

casts the integers in array.a into floats a bug?

nsmith- avatar Jul 14 '20 21:07 nsmith-

I just took a look at this and I agree that it could be a better interface. But before developing a new function, perhaps I should throw some more ideas into the mix.

The real issue here is that the padding and filling aren't going all the way down to the numeric level: they're applying to the records. That's why we get Nones in the place of the records (and the ? is on the record type, not the numeric fields within the record):

>>> ak.pad_none(array, 2, clip=True)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * ?{"a": int64, "b": fl...'>
>>> ak.pad_none(array, 2, clip=True).tolist()
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}], [None, None], [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

Then when these get filled with zeros, they're zeros in the place of records, which has to be a union.

>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0).tolist()
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}], [0, 0], [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

What you really want are zeros in place of the numeric fields, which do unify the array elements with the fill value. To get at the fields individually, we can ak.unzip the records (remembering that breaking and merging records is an O(1) operation that we can do freely).

>>> ak.unzip(array)
(<Array [[1, 2], [], [3, 4, 5]] type='3 * var * int64'>,
 <Array [[1.1, 2.2], [], [3.3, 4.4, 5.5]] type='3 * var * float64'>)

So what we really need to do is apply the padding and filling to each of these arrays. We can do it independently of the number of record fields with a list comprehension,

>>> [ak.fill_none(ak.pad_none(x, 2, clip=True), 0) for x in ak.unzip(array)]
[<Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * int64'>,
 <Array [[1.1, 2.2], [0, 0], [3.3, 4.4]] type='3 * 2 * float64'>]

and then to wrap the whole thing up, we can reverse the unzip with ak.zip.

>>> regularized = ak.zip(dict(zip(
...     ak.keys(array),
...     [ak.fill_none(ak.pad_none(x, 2, clip=True), 0) for x in ak.unzip(array)]
... )))
>>> ak.type(regularized)
3 * 2 * {"a": int64, "b": float64}
>>> ak.to_list(regularized)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
 [{'a': 0, 'b': 0.0}, {'a': 0, 'b': 0.0}],
 [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

Maybe this should have a high-level function? ak.pad_fields?

Combining ak.pad_none and ak.fill_none into a single ak.pad makes sense (the implementation would just combine the operations on the Python side), but this ak.pad_fields is a different thing: it operates at the field level. Perhaps there needs to be ak.pad_fields_none and ak.fill_fields_none as well? No, because ak.fill_fields_none, at least, isn't any different from the ak.fill_none operation (which recursively replaces None values).

>>> only_padded = ak.zip(dict(zip(
...     ak.keys(array), [ak.pad_none(x, 2, clip=True) for x in ak.unzip(array)]
... )))
>>> ak.type(only_padded)
3 * 2 * {"a": ?int64, "b": ?float64}
>>> ak.to_list(only_padded)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
 [{'a': None, 'b': None}, {'a': None, 'b': None}],
 [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]
>>> 
>>> regularized = ak.fill_none(only_padded, 0)
>>> ak.type(regularized)
3 * 2 * {"a": int64, "b": float64}
>>> ak.to_list(regularized)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
 [{'a': 0, 'b': 0.0}, {'a': 0, 'b': 0.0}],
 [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

So the missing functionality is ak.pad_fields_none (distinct from ak.pad_none's axis parameter because axis is the number of nested list depths, not record depths) and maybe convenience functions that merge ak.pad_none/ak.pad_fields_none and ak.fill_none.

Actually, the ak.pad_none/ak.pad_fields_none thing feels like it ought to be a function parameter. Then the convenience function ak.pad would have that same parameter.

jpivarski avatar Jul 20 '20 18:07 jpivarski

Isn't the fact that

In [9]: ak.fill_none(ak.pad_none(array.a, 2, clip=True), 0.)
Out[9]: <Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * float64'>

casts the integers in array.a into floats a bug?

@nsmith- No, that's intentional:

>>> ak.fill_none(ak.Array([1, 2, None, 4]), 3)
<Array [1, 2, 3, 4] type='4 * int64'>
>>> ak.fill_none(ak.Array([1, 2, None, 4]), 3.0)
<Array [1, 2, 3, 4] type='4 * float64'>

What's happening here is that Nones are first replaced by a temporary UnionArray that combines whatever is in the array with whatever the replacement value is: union[int64, int64] and union[int64, float64] in the two cases above. Then we attempt to simplify the temporary UnionArray. Unions of two numeric types can be unified to a numeric type, which is the broadest of the numeric choices: int64 and float64 in the two cases above. It is equivalent to the type unification that NumPy performs when concatenating:

>>> np.concatenate([np.array([1, 2, 3]), np.array([4])])
array([1, 2, 3, 4])
>>> np.concatenate([np.array([1, 2, 3]), np.array([4.0])])
array([1., 2., 3., 4.])

(In fact, ak.concatenate calls does this through a UnionArray simplify, too. The PR #337 that you motivated by finding NumPy dtype bugs ensures that we now use exactly the same unification rules as NumPy.)

In @nikoladze's case, the UnionArray of records and numbers (zero) could not be simplified.

jpivarski avatar Jul 20 '20 18:07 jpivarski

In case you're wondering what all of this is about, I'm going through all of our open issues from oldest to newest to decide what should be done with them, post-2.0.

In this case, @nikoladze's array can be converted to NumPy if you pay attention to all the details of which axis needs to be padded and with some numeric fill value (i.e. don't try to fill missing records with a number). There ought to be a function to make some reasonable choices (apply standardized rules) to turn anything rectilinear with a given fill value that is by default 0. Maybe another function argument to choose between clipping to the smallest list length versus padding to the longest (the latter is the default).

The point of this is to remember that sometimes, we don't care about structure and don't want to think about it: we just want a NumPy array somehow. This would be a good function to develop with ak.transform; the hardest part might be naming it...

jpivarski avatar Dec 12 '22 15:12 jpivarski