awkward
awkward copied to clipboard
ak.zip seems to work recursively, but it doesn't really
In dask-contrib/dask-awkward/issues/213, @masonproffitt asked for
>>> a = ak.Array([1])
>>> ak.zip({'a': a, 'b': {'c': a}})
<Array [{a: 1, b: {...}}] type='1 * {a: int64, b: {c: var * int64}}'>
to work in dask-awkward as it does in Awkward. But ak.zip
doesn't really do a nested zip; it just calls ak.to_layout
on each of the values of the dict.
https://github.com/scikit-hep/awkward/blob/6a24ed0d436bcd158f634d9bd9f6d664fff6bd2b/src/awkward/operations/ak_zip.py#L174-L190
For a nested dict (general, non-Awkward, non-ndarray container), that means it switches over into ak.from_iter
, which (1) is slow, (2) ignores numeric types, and (3) doesn't zip: it makes the difference between an array of structs and a struct of arrays in the data type that you get back.
>>> array = ak.zip({
... "a": {"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)},
... "d": {"e": np.arange(10, dtype=np.int32), "f": np.arange(10, dtype=np.float32)},
... })
>>> array.show(type=True)
type: {
a: {
b: var * int64,
c: var * int64
},
d: {
e: var * int64,
f: var * float64
}
}
{a: {b: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], c: [0, 1, ..., 9]},
d: {e: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], f: [0, 1, ..., 9]}}
whereas
>>> array2 = ak.zip({"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)})
>>> array2.show(type=True)
type: 10 * {
b: int8,
c: int16
}
[{b: 0, c: 0},
{b: 1, c: 1},
{b: 2, c: 2},
{b: 3, c: 3},
{b: 4, c: 4},
{b: 5, c: 5},
{b: 6, c: 6},
{b: 7, c: 7},
{b: 8, c: 8},
{b: 9, c: 9}]
You see all of the integer types turn into int64
and float32
into float64
because ak.from_iter
treats them as Python int
and float
, which loses dtype. You also see a different structure for the nested object.
It's not obvious to me what the correct behavior is. Treating any expected array-like uniformly with ak.to_layout
is good for consistency, but @masonproffitt's interpretation is natural, too.
Originally posted by @jpivarski in https://github.com/dask-contrib/dask-awkward/issues/213#issuecomment-1497887851
I agree that this is a policy
question.
I'd vote in favour of not recursively zipping, because ak.zip
accepts useful parameters that might not apply to each call to ak.zip
identically, i.e. the user may well want different depth_limit
values. For simplicity and consistency, I'd prefer to require the user to call ak.zip
multiple times.
I'd vote in favour of not recursively zipping, because
ak.zip
accepts useful parameters that might not apply to each call toak.zip
identically, i.e. the user may well want differentdepth_limit
values. For simplicity and consistency, I'd prefer to require the user to callak.zip
multiple times.
I have no problem with this being the default behavior, but I'd love to see automatic recursive zipping as a feature (maybe as an optional argument to ak.zip
?). This is a pretty common use case in handling func-adl-uproot
queries.