`ak.where` produces different output types and shapes in awkward2 versus awkward1 since PR 3168
Version of Awkward Array
2.8.4
Description and code to reproduce
The following errors in awkward2:
In [1]: import awkward as ak
In [2]: import pathlib
In [3]: condition = ak.from_json(pathlib.Path("condition_v2.json"))
In [4]: array = ak.from_json(pathlib.Path("DPhiHJ0J1_v2.json"))
In [5]: ak.where(condition, -999, array)
with
ValueError: cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)
This error occurred while calling
ak.where(
<Array [None, None, ..., [False], [False]] type='9610 * option[var ...'>
-999
<Array [None, None, ..., [2.74], [3.06]] type='9610 * option[var * ...'>
)
while it's fine in awkward1
In [1]: import pathlib
In [2]: import awkward as ak
In [3]: array = ak.from_json(pathlib.Path("DPhiHJ0J1_v2.json"))
In [4]: array
Out[4]: <Array [None, None, None, [3.12, ... [3.06]] type='9610 * option[var * float64]'>
In [5]: condition = ak.from_json(pathlib.Path("condition_v2.json"))
In [6]: ak.where(condition, -999, array)
Out[6]: <Array [None, None, None, [3.12, ... [3.06]] type='9610 * option[var * float64]'>
In [7]: ak.__version__
Out[7]: '1.10.5'
The arrays in json format are here: DPhiHJ0J1_v2.json condition_v2.json
The weird thing in this example is that the mask (condition) is itself an option type. Maybe that shouldn't even exist and a more graceful error should be raised. An option type mask feels a bit weird in my head. I don't know what the history of awkward says for such cases.
Edit:
It was found out that the original arrays in json of this issue had broadcasting problems originating from user code and the issue regarding that is unrelated to awkward.
However, it was found that there are inconsistencies in the shape and type of ak.wheres output since https://github.com/scikit-hep/awkward/pull/3168.
For example before that PR (and awkward1):
In [1]: import awkward as ak
In [2]: x = ak.Array([None, [True], None, [False]])
In [3]: y = ak.Array([None, [1], None, [2]])
In [4]: z = ak.Array([10, 11, 12, 13])
In [5]: ak.where(x, y, z)
Out[5]: <Array [None, [1], None, [13]] type='4 * option[var * int64]'>
With that PR:
In [5]: ak.where(x, y, z)
Out[5]: <Array [[None], [1], [None], [13]] type='4 * var * ?int64'>
I find the former output type and shape a lot more intuitive. We should probably try to get the best of both worlds. That PR fixed some problems with ak.where and that should stay as is but also introduced some output type and shape changes that are probably not the best. See the long discussion below for more info.
@pfackeldey I'm opening this for reference even though I told you the bug already.
@ikrommyd - thank you for reporting it!!
Hmm these arrays are weird. This is probably an awkward1 bug actually. These do not match:
In [48]: ak.drop_none(ak.num(array))
Out[48]: <Array [1, 0, 0, 1, 0, 0, 0, 1, ..., 1, 1, 0, 1, 1, 1, 1, 1] type='891 * int64'>
In [49]: ak.drop_none(ak.num(condition))
Out[49]: <Array [1, 0, 0, 1, 0, 0, 0, 1, ..., 1, 1, 0, 1, 1, 1, 1, 1] type='910 * int64'>
and also
In [54]: ak.all(ak.fill_none(ak.num(array), -999) == ak.fill_none(ak.num(condition), -999))
Out[54]: np.False_
Where seems fine in small examples like:
In [56]: ak.where([None, True, None, False], -999, [None, 1, None, 2])
Out[56]: <Array [None, -999, None, 2] type='4 * ?int64'>
and
In [57]: ak.where([None, [True], None, [False]], -999, [None, [1], None, [2]])
Out[57]: <Array [[None], [-999], [None], [2]] type='4 * var * ?int64'>
It also seems fine with subsets of the arrays like
In [59]: condition[-10:]
Out[59]: <Array [None, None, None, ..., [False], [False]] type='10 * option[var * bool]'>
In [60]: array[-10:]
Out[60]: <Array [None, None, None, ..., [2.74], [3.06]] type='10 * option[var * floa...'>
In [61]: ak.where(condition[-10:], -999, array[-10:])
Out[61]: <Array [[None], [None], [None], ..., [2.74], [3.06]] type='10 * var * ?float64'>
and
In [63]: condition[:10]
Out[63]: <Array [None, None, None, ..., None, None, None] type='10 * option[var * bool]'>
In [64]: array[:10]
Out[64]: <Array [None, None, None, ..., None, None] type='10 * option[var * float64]'>
In [65]: ak.where(condition[:10], -999, array[:10])
Out[65]: <Array [[None], [None], [None], ..., [None], [None]] type='10 * var * ?float64'>
It also doesn't error if you manually broadcast first and pass the broadcasted arrays to where.
In [68]: ak.broadcast_arrays(array, condition)
Out[68]:
[<Array [None, None, None, ..., [2.74], [3.06]] type='9610 * option[var * fl...'>,
<Array [None, None, ..., [False], [False]] type='9610 * option[var * bool]'>]
In [69]: ak.where(ak.broadcast_arrays(array, condition)[1], -999, ak.broadcast_arrays(array, condition)[0])
Out[69]: <Array [[None], [None], ..., [2.74], [3.06]] type='9610 * var * ?float64'>
Notice however that there is a difference in behavior. Awkward1 doesn't error and returns this:
In [6]: ak.where(condition, -999, array)
Out[6]: <Array [None, None, None, [3.12, ... [3.06]] type='9610 * option[var * float64]'>
So the Nones are not wrapped in a jagged dimension while Awkward2 does wrap Nones in a jagged dimension and gives you [None] in the final output array in the cases where it works (all examples above). Awkward1 seems like it ignores the Nones and just leaves them as is and only does perform the where operations on everything else.
This may be just an awkward1 problem (which we won't fix) and a question whether a better error or some better handling of such arrays can be made.
@pfackeldey You did any more investigation on what's happening here while I was away?
I found out that the shapes don't match as well. Also we do support ak.where with Nones, see the tests: https://github.com/scikit-hep/awkward/blob/main/tests/test_3098_ak_where_with_arrays_containing_optionals.py
I didn't investigate further, because - as you pointed out - this looks like no problem in awkward, at least to my understanding.
the last part (that it works when using broadcasting) is interesting again. I'm not sure if I fully understand what is happening there. We do have a special code path for broadcasting options in ak.where: https://github.com/scikit-hep/awkward/blob/main/src/awkward/_broadcasting.py#L861
Well there is still some inconsistency right? In the cases where it works for example here:
In [59]: condition[-10:]
Out[59]: <Array [None, None, None, ..., [False], [False]] type='`'>
In [60]: array[-10:]
Out[60]: <Array [None, None, None, ..., [2.74], [3.06]] type='10 * option[var * floa...'>
In [61]: ak.where(condition[-10:], -999, array[-10:])
Out[61]: <Array [[None], [None], [None], ..., [2.74], [3.06]] type='10 * var * ?float64'>
The condition is N * option[var * bool] and the array is N * option[var * float64]. ak.where seems to broadcast that to N * var * ?float64. So the Nones are wrapped in a jagged dimension while awkward1 maintains the N * option[var * float64] type. See:
In [6]: ak.where(condition, -999, array)
Out[6]: <Array [None, None, None, [3.12, ... [3.06]] type='9610 * option[var * float64]'>
In [7]: ak.__version__
Out[7]: '1.10.5'
You can simply see this in the following simple example
In [1]: import awkward as ak
In [2]: ak.__version__
Out[2]: '1.10.5'
In [3]: ak.where([None, [True], None, [False]], -999, [None, [1], None, [2]])
Out[3]: <Array [None, [-999], None, [2]] type='4 * option[var * int64]'>
versus
In [1]: import awkward as ak
In [2]: ak.__version__
Out[2]: '2.8.4'
In [3]: ak.where([None, [True], None, [False]], -999, [None, [1], None, [2]])
Out[3]: <Array [[None], [-999], [None], [2]] type='4 * var * ?int64'>
So something is definitely not consistent in how ak.where applies broadcasting.
See here for example how ak.where changes the type while addition doesn't:
In [21]: x = ak.Array([None, [True], None, [False]])
In [22]: y = ak.Array([None, [1], None, [2]])
In [23]: z = ak.Array([10, 11, 12, 13])
In [24]: ak.where(x, y, z)
Out[24]: <Array [[None], [1], [None], [13]] type='4 * var * ?int64'>
In [25]: x + y
Out[25]: <Array [None, [2], None, [2]] type='4 * option[var * int64]'>
In [26]: x + z
Out[26]: <Array [None, [12], None, [13]] type='4 * option[var * int64]'>
In [27]: y + z
Out[27]: <Array [None, [12], None, [15]] type='4 * option[var * int64]'>
Plain broadcasting doesn't change the type like that:
In [10]: ak.broadcast_arrays(x, y, -999)
Out[10]:
[<Array [None, [True], None, [False]] type='4 * option[var * bool]'>,
<Array [None, [1], None, [2]] type='4 * option[var * int64]'>,
<Array [None, [-999], None, [-999]] type='4 * option[var * int64]'>]
awkward1 on the other hand gives
In [8]: ak.where(x, y, z)
Out[8]: <Array [None, [1], None, [13]] type='4 * option[var * int64]'>
@ikrommyd - if we use awkward 2 - is it wrong? We will not be fixing awkward 1. I don’t think we should use it as a reference. Please, check if ak.where gives you correct results with awkward2. Thanks!
I think the original failure with ak.where is because the condition and the array have different raggedness, which makes it correct that it doesn't work in awkward 2 I'd say. I rather consider awkward 1 wrong in that case.
To the second (new) part: that the output dimensions with broadcastable inputs to ak.where are different is something to look into; it's however a completely different issue than where we began with in this issue thread.
Well the original issue is that the ak.where behavior differs and it indeed does. ak.where changes the type in this case. I find this odd but I can't say with absolute certainty what the right behavior is. In general I don't think that ak.where should change the type like that
@ikrommyd - if we use awkward 2 - is it wrong? We will not be fixing awkward 1. I don’t think we should use it as a reference. Please, check if
ak.wheregives you correct results with awkward2. Thanks!
Well the original arrays seem to be a bit wrong so that points to a problem in user code and I don't think that ak.where gives a strictly "wrong" result. However that doesn't change the fact that ak.where behavior seems a bit off with its type-changing thingy.
@pfackeldey there you go. Performed ak.where in slices of length 10 to find the problem.
In [9]: length = len(condition)
...: problematic = []
...:
...: for start in range(0, length, 10):
...: end = min(start + 10, length)
...: cond_chunk = condition[start:end]
...: arr_chunk = array [start:end]
...: try:
...: _ = ak.where(cond_chunk, -999, arr_chunk)
...: except Exception as e:
...: print(f"❌ Error in chunk {start}:{end} \n {e!r}")
...: print(" condition slice:", cond_chunk)
...: print(" array slice: ", arr_chunk)
...: problematic.append((start, end, e))
...:
...: if not problematic:
...: print("✅ No errors found in any 10-element chunk.")
...: else:
...: print(f"\nFound {len(problematic)} problematic chunk(s):")
...: for start, end, err in problematic:
...: print(f" • {start}:{end} → {err!r}")
...:
❌ Error in chunk 1340:1350
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, None, None, [False], [], None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 4390:4400
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [[], None, None, None, None, None, None, None, None, []]
array slice: [[], None, None, None, None, None, None, None, None, None]
❌ Error in chunk 7760:7770
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, [], None, None, None, None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 9420:9430
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, None, None, None, None, [], None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 9430:9440
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, [False], [], None, None, None, None, None, None, None]
array slice: [None, [2.94], None, None, None, None, None, None, None, None]
Found 5 problematic chunk(s):
• 1340:1350 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 4390:4400 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 7760:7770 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 9420:9430 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 9430:9440 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
Actually, using ak.where(ak.to_packed(cond_chunk), -999, ak.to_packed(arr_chunk)) is the correct thing to do when slicing which prints out
❌ Error in chunk 320:330
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [[], None, None, None, None, [False], None, [], None, None]
array slice: [[], None, None, None, None, None, None, [], None, None]
❌ Error in chunk 780:790
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, [], [False], None, None, None, None, None, None]
array slice: [None, None, [], None, None, None, None, None, None, None]
❌ Error in chunk 1340:1350
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, None, None, [False], [], None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 2320:2330
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, [False], None, None, None, None, None, None, None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 2620:2630
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, [False], None, None, None, None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 2890:2900
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, None, None, None, None, None, [False]]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 5100:5110
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, None, None, None, [False], None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 9030:9040
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, None, None, [False], None, None, None, None, None]
array slice: [None, None, None, None, None, None, None, None, None, None]
❌ Error in chunk 9430:9440
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, [False], [], None, None, None, None, None, None, None]
array slice: [None, [2.94], None, None, None, None, None, None, None, None]
Found 9 problematic chunk(s):
• 320:330 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 780:790 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 1340:1350 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 2320:2330 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 2620:2630 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 2890:2900 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 5100:5110 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 9030:9040 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 9430:9440 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
while ak.where(ak.to_list(cond_chunk), -999, ak.to_list(arr_chunk)) gives
❌ Error in chunk 320:330
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [[], None, None, None, None, [False], None, [], None, None]
array slice: [[], None, None, None, None, None, None, [], None, None]
❌ Error in chunk 780:790
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, None, [], [False], None, None, None, None, None, None]
array slice: [None, None, [], None, None, None, None, None, None, None]
❌ Error in chunk 9430:9440
ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
condition slice: [None, [False], [], None, None, None, None, None, None, None]
array slice: [None, [2.94], None, None, None, None, None, None, None, None]
Found 3 problematic chunk(s):
• 320:330 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 780:790 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
• 9430:9440 → ValueError('cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)')
So there seem to be things like this coming from the user's code
ak.where([None, [False], []], -999, [None, [2.94], None])
or with two rows:
ak.where([[False], []], -999, [[2.94], None])
which will indeed give
ValueError: cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-46/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)
This error occurred while calling
ak.where(
[None, [False], []]
-999
[None, [2.94], None]
)
Awkward1 handled those like this
In [3]: ak.where([None, [False], []], -999, [None, [2.94], None])
Out[3]: <Array [None, [2.94], None] type='3 * option[var * float64]'>
In [4]: ak.where([[False], []], -999, [[2.94], None])
Out[4]: <Array [[2.94], None] type='2 * option[var * float64]'>
If we look at the single element cases of the above
In [78]: ak.__version__
Out[78]: '2.8.4'
In [79]: ak.where([[]], -999, [None])
Out[79]: <Array [[]] type='1 * var * ?int64'>
In [80]: ak.where([[False]], -999, [[2.94]])
Out[80]: <Array [[2.94]] type='1 * var * float64'>
In [6]: ak.__version__
Out[6]: '1.10.5'
In [7]: ak.where([[]], -999, [None])
Out[7]: <Array [None] type='1 * option[var * int64]'>
In [8]: ak.where([[False]], -999, [[2.94]])
Out[8]: <Array [[2.94]] type='1 * var * float64'>
To me, what awkward2 does here is weird and wrong. It just passes the empty condition through? awkward1 seems to treat empty as false and gives [None] back.
In [7]: ak.where([[]], -999, [None])
Out[7]: <Array [None] type='1 * option[var * int64]'>
Well the original arrays seem to be a bit wrong so that points to a problem in user code and I don't think that ak.where gives a strictly "wrong" result. However that doesn't change the fact that ak.where behavior seems a bit off with its type-changing thingy.
yes, that's what I meant. We should investigate the type change, but not the 'problem' (that's not actually a problem, it's a correct failure) of ak.where with this condition and array as inputs - they are not aligning from their shapes as you see in your chunked examples.
So this is coming directly from the apply_step function call. apply_step returns this empty array back.
In [79]: ak.where([[]], -999, [None])
Out[79]: <Array [[]] type='1 * var * ?int64'>
Awkward1 appears to treat it as if it was this (empty jagged dimension = False) although this changes the type to 1 * ?int64 while awkward1 maintains 1 * option[var * int64]
In [4]: ak.where([False], -999, [None])
Out[4]: <Array [None] type='1 * ?int64'>
vs
In [7]: ak.where([[]], -999, [None])
Out[7]: <Array [None] type='1 * option[var * int64]'>
Oh, it looks like ak.where has a specific broadcasting function for option types that only it uses: https://github.com/scikit-hep/awkward/blob/473029b8cc7345083c29186e0c3def874eb3a9f2/src/awkward/_broadcasting.py#L861
Changing ak.where to use the nominal option type broadcasting preserves the type in the first examples I did
diff --git a/src/awkward/_broadcasting.py b/src/awkward/_broadcasting.py
index 5f6903f7..e29a4140 100644
--- a/src/awkward/_broadcasting.py
+++ b/src/awkward/_broadcasting.py
@@ -1157,7 +1157,7 @@ def apply_step(
if all(not x.is_option or isinstance(x, UnmaskedArray) for x in contents):
return broadcast_any_option_all_UnmaskedArray()
elif options["function_name"] == "ak.where":
- return broadcast_any_option_akwhere()
+ return broadcast_any_option()
else:
return broadcast_any_option()
In [1]: import awkward as ak
In [2]: x = ak.Array([None, [True], None, [False]])
In [3]: y = ak.Array([None, [1], None, [2]])
In [4]: z = ak.Array([10, 11, 12, 13])
In [5]: ak.where(x, y, z)
Out[5]: <Array [None, [1], None, [13]] type='4 * option[var * int64]'>
In [6]: ak.where([[]], -999, [None])
Out[6]: <Array [None] type='1 * option[var * float64]'>
It also makes the errors go away here:
In [3]: ak.where([None, [False], []], -999, [None, [2.94], None])
Out[3]: <Array [None, [2.94], None] type='3 * option[var * float64]'>
In [4]: ak.where([[False], []], -999, [[2.94], None])
Out[4]: <Array [[2.94], None] type='2 * option[var * float64]'>
In [1]: import pathlib
In [2]: import awkward as ak
In [3]: condition = ak.from_json(pathlib.Path("condition_v2.json"))
In [4]: array = ak.from_json(pathlib.Path("DPhiHJ0J1_v2.json"))
In [5]: ak.where(condition, -999, array)
Out[5]: <Array [None, None, None, ..., [2.74], [3.06]] type='9610 * option[var * fl...'>
This change however gives 3 test failures
FAILED tests/test_3098_ak_where_with_arrays_containing_optionals.py::test_ak_where_with_optional_unknowns - assert [None] == [1]
FAILED tests/test_3098_ak_where_with_arrays_containing_optionals.py::test_ak_where_with_optionals - assert [None, 5, None] == [1, 5, None]
FAILED tests/test_3098_ak_where_with_arrays_containing_optionals.py::test_ak_where_with_optionals_multidim - assert [[1, 2], None] == [[1, 2], [12, 13]]
I wonder if we can have the best of both worlds somehow when it comes to broadcasting here.
Okay so all that came from this PR: https://github.com/scikit-hep/awkward/pull/3168 and you will notice that in the tests, there are some "Fails ATOW" and "# ATOW we get X" comments. Well awkward1 gives exactly those wrong answers. So indeed it's awkward1's ak.where that was problematic and it was the case for awkward2 as well until this PR fixed it about a year ago. This PR however makes ak.where do this type change so I wonder if we can improve on that.
We should be getting correct answers without changing the type if possible.
Without this PR:
In [1]: import awkward as ak
In [2]: x = ak.Array([None, [True], None, [False]])
In [3]: y = ak.Array([None, [1], None, [2]])
In [4]: z = ak.Array([10, 11, 12, 13])
In [5]: ak.where(x, y, z).layout
Out[5]:
<IndexedOptionArray len='4'>
<index><Index dtype='int64' len='4'>
[-1 0 -1 1]
</Index></index>
<content><ListOffsetArray len='2'>
<offsets><Index dtype='int64' len='3'>[0 1 2]</Index></offsets>
<content><NumpyArray dtype='int64' len='2'>[ 1 13]</NumpyArray></content>
</ListOffsetArray></content>
</IndexedOptionArray>
With this PR (and currently awkward main):
In [5]: ak.where(x, y, z).layout
Out[5]:
<ListOffsetArray len='4'>
<offsets><Index dtype='int64' len='5'>
[0 1 2 3 4]
</Index></offsets>
<content><ByteMaskedArray valid_when='false' len='4'>
<mask><Index dtype='int8' len='4'>[1 0 1 0]</Index></mask>
<content><NumpyArray dtype='int64' len='4'>[10 1 12 13]</NumpyArray></content>
</ByteMaskedArray></content>
</ListOffsetArray>
I prefer this
In [1]: import awkward as ak
In [2]: x = ak.Array([None, [True], None, [False]])
In [3]: y = ak.Array([None, [1], None, [2]])
In [4]: z = ak.Array([10, 11, 12, 13])
In [5]: ak.where(x, y, z)
Out[5]: <Array [None, [1], None, [13]] type='4 * option[var * int64]'>
a lot more than this
In [5]: ak.where(x, y, z)
Out[5]: <Array [[None], [1], [None], [13]] type='4 * var * ?int64'>