awkward icon indicating copy to clipboard operation
awkward copied to clipboard

ak.zip seems to work recursively, but it doesn't really

Open jpivarski opened this issue 1 year ago • 2 comments

In dask-contrib/dask-awkward/issues/213, @masonproffitt asked for

>>> a = ak.Array([1])
>>> ak.zip({'a': a, 'b': {'c': a}})
<Array [{a: 1, b: {...}}] type='1 * {a: int64, b: {c: var * int64}}'>

to work in dask-awkward as it does in Awkward. But ak.zip doesn't really do a nested zip; it just calls ak.to_layout on each of the values of the dict.

https://github.com/scikit-hep/awkward/blob/6a24ed0d436bcd158f634d9bd9f6d664fff6bd2b/src/awkward/operations/ak_zip.py#L174-L190

For a nested dict (general, non-Awkward, non-ndarray container), that means it switches over into ak.from_iter, which (1) is slow, (2) ignores numeric types, and (3) doesn't zip: it makes the difference between an array of structs and a struct of arrays in the data type that you get back.

>>> array = ak.zip({
...     "a": {"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)},
...     "d": {"e": np.arange(10, dtype=np.int32), "f": np.arange(10, dtype=np.float32)},
... })
>>> array.show(type=True)
type: {
    a: {
        b: var * int64,
        c: var * int64
    },
    d: {
        e: var * int64,
        f: var * float64
    }
}
{a: {b: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], c: [0, 1, ..., 9]},
 d: {e: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], f: [0, 1, ..., 9]}}

whereas

>>> array2 = ak.zip({"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)})
>>> array2.show(type=True)
type: 10 * {
    b: int8,
    c: int16
}
[{b: 0, c: 0},
 {b: 1, c: 1},
 {b: 2, c: 2},
 {b: 3, c: 3},
 {b: 4, c: 4},
 {b: 5, c: 5},
 {b: 6, c: 6},
 {b: 7, c: 7},
 {b: 8, c: 8},
 {b: 9, c: 9}]

You see all of the integer types turn into int64 and float32 into float64 because ak.from_iter treats them as Python int and float, which loses dtype. You also see a different structure for the nested object.

It's not obvious to me what the correct behavior is. Treating any expected array-like uniformly with ak.to_layout is good for consistency, but @masonproffitt's interpretation is natural, too.

Originally posted by @jpivarski in https://github.com/dask-contrib/dask-awkward/issues/213#issuecomment-1497887851

jpivarski avatar Apr 05 '23 17:04 jpivarski

I agree that this is a policy question.

I'd vote in favour of not recursively zipping, because ak.zip accepts useful parameters that might not apply to each call to ak.zip identically, i.e. the user may well want different depth_limit values. For simplicity and consistency, I'd prefer to require the user to call ak.zip multiple times.

agoose77 avatar May 09 '23 12:05 agoose77

I'd vote in favour of not recursively zipping, because ak.zip accepts useful parameters that might not apply to each call to ak.zip identically, i.e. the user may well want different depth_limit values. For simplicity and consistency, I'd prefer to require the user to call ak.zip multiple times.

I have no problem with this being the default behavior, but I'd love to see automatic recursive zipping as a feature (maybe as an optional argument to ak.zip?). This is a pretty common use case in handling func-adl-uproot queries.

masonproffitt avatar May 31 '23 16:05 masonproffitt