Problem with awkward array masking with exclusive_jets_constituent_index()
Hello, I am working with @kpachal, @mswiatlo, and @lgray on the development of Coffea and some analysis that involves the use of fastjet. We've been very happy with how smoothly everything using awkward arrays integrates with fastjet.
However, we have noticed a problem when running exclusive_jets_constituent_index() with a masked awkward array. If an awkward array has a boolean mask applied to it on axis 0 before clustering, exclusive_jets_constituent_index() returns indices that are out of range. An error is also thrown when trying to call exclusive_jets_constituents() since the out of range indices are being applied to the array.
Here is an example. Running this
import awkward as ak
import fastjet
px = ak.Array([[-0.181, 0.377],
[0.54, 0.253],
[0.294, 4.65],
[1.45, 12.8],
[-3.55, -1.33]])
py = ak.Array([[-0.798, 0.116],
[-0.65, -0.0342],
[0.254, -1.88],
[-0.179, -1.8],
[-1.64, -1.03]])
pz = ak.Array([[-0.652, -0.0749],
[-0.00527, -0.0731],
[-0.259, -3.29],
[-0.876, -7.18],
[-0.0941, 0.147]])
E = ak.Array([[1.06, 0.425],
[0.857, 0.3],
[0.467, 6],
[1.71, 14.8],
[3.91, 1.7]])
mask = ak.Array([True,True,False,True,True])
eg = ak.zip(
{
'px': px,
'py': py,
'pz': pz,
'E': E,
},
)
jetdef = fastjet.JetDefinition(fastjet.kt_algorithm,1)
fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituent_index(njets=2)
returns
[[[1], [0]],
[[0], [1]],
[[1], [0, 2, 3]],
[[0], [1]]]
---------------------------
type: 4 * var * var * int32
which contains out of range indices at index 2 on axis 0. If we then run
fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituents(njets=2)
it throws the error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/awkward/highlevel.py:950, in Array.__getitem__(self, where)
949 with ak._errors.SlicingErrorContext(self, where):
--> 950 out = self._layout[where]
951 if isinstance(out, ak.contents.NumpyArray):
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:510, in Content.__getitem__(self, where)
509 def __getitem__(self, where):
--> 510 return self._getitem(where)
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:563, in Content._getitem(self, where)
562 elif isinstance(where, ak.highlevel.Array):
--> 563 return self._getitem(where.layout)
565 # Convert between nplikes of different backends
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:638, in Content._getitem(self, where)
637 elif isinstance(where, Content):
--> 638 return self._getitem((where,))
640 elif is_sized_iterable(where):
641 # Do we have an array
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:555, in Content._getitem(self, where)
548 next = ak.contents.RegularArray(
549 this,
550 this.length,
551 1,
552 parameters=None,
553 )
--> 555 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
557 if out.length is not unknown_length and out.length == 0:
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/regulararray.py:708, in RegularArray._getitem_next(self, head, tail, advanced)
693 self._maybe_index_error(
694 self._backend[
695 "awkward_RegularArray_getitem_jagged_expand",
(...)
706 slicer=head,
707 )
--> 708 down = self._content._getitem_next_jagged(
709 multistarts, multistops, head._content, tail
710 )
712 return RegularArray(
713 down, headlength, self._length, parameters=self._parameters
714 )
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py:409, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
406 out = ak.contents.ListArray(
407 self.starts, self.stops, self._content, parameters=self._parameters
408 )
--> 409 return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listarray.py:497, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
495 sliceoffsets = slicecontent._offsets
--> 497 outcontent = next_content._getitem_next_jagged(
498 sliceoffsets[:-1], sliceoffsets[1:], slicecontent._content, tail
499 )
501 return ak.contents.ListOffsetArray(
502 outoffsets, outcontent, parameters=self._parameters
503 )
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listarray.py:544, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
535 assert (
536 outoffsets.nplike is self._backend.index_nplike
537 and nextcarry.nplike is self._backend.index_nplike
(...)
542 and self._stops.nplike is self._backend.index_nplike
543 )
--> 544 self._maybe_index_error(
545 self._backend[
546 "awkward_ListArray_getitem_jagged_apply",
547 outoffsets.dtype.type,
548 nextcarry.dtype.type,
549 slicestarts.dtype.type,
550 slicestops.dtype.type,
551 sliceindex.dtype.type,
552 self._starts.dtype.type,
553 self._stops.dtype.type,
554 ](
555 outoffsets.data,
556 nextcarry.data,
557 slicestarts.data,
558 slicestops.data,
559 slicestarts.length,
560 sliceindex.data,
561 sliceindex.length,
562 self._starts.data,
563 self._stops.data,
564 self._content.length,
565 ),
566 slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
567 )
568 nextcontent = self._content._carry(nextcarry, True)
File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:271, in Content._maybe_index_error(self, error, slicer)
270 message = self._backend.format_kernel_error(error)
--> 271 raise ak._errors.index_error(self, slicer, message)
IndexError: cannot slice ListArray (of length 8) with [[1], [0], [0], [1], [1], [0, 2, 3], [0], [1]]: index out of range while attempting to get index 3 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-17/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)
The above exception was the direct cause of the following exception:
IndexError Traceback (most recent call last)
Cell In[5], line 1
----> 1 fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituents(njets=2)
File ~/anaconda3/lib/python3.10/site-packages/fastjet/_pyjet.py:147, in AwkwardClusterSequence.exclusive_jets_constituents(self, njets)
146 def exclusive_jets_constituents(self, njets=10):
--> 147 return self._internalrep.exclusive_jets_constituents(njets)
File ~/anaconda3/lib/python3.10/site-packages/fastjet/_multievent.py:307, in _classmultievent.exclusive_jets_constituents(self, njets)
305 duplicate = ak.unflatten(np.zeros(total, np.int64), shape)
306 prepared = self.data[:, np.newaxis][duplicate]
--> 307 return prepared[outputs_to_inputs]
File ~/anaconda3/lib/python3.10/site-packages/awkward/highlevel.py:949, in Array.__getitem__(self, where)
520 def __getitem__(self, where):
521 """
522 Args:
523 where (many types supported; see below): Index of positions to
(...)
947 have the same dimension as the array being indexed.
948 """
--> 949 with ak._errors.SlicingErrorContext(self, where):
950 out = self._layout[where]
951 if isinstance(out, ak.contents.NumpyArray):
File ~/anaconda3/lib/python3.10/site-packages/awkward/_errors.py:63, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
60 try:
61 # Handle caught exception
62 if exception_type is not None and self.primary() is self:
---> 63 self.handle_exception(exception_type, exception_value)
64 finally:
65 # `_kwargs` may hold cyclic references, that we really want to avoid
66 # as this can lead to large buffers remaining in memory for longer than absolutely necessary
67 # Let's just clear this, now.
68 self._kwargs.clear()
File ~/anaconda3/lib/python3.10/site-packages/awkward/_errors.py:78, in ErrorContext.handle_exception(self, cls, exception)
76 self.decorate_exception(cls, exception)
77 else:
---> 78 raise self.decorate_exception(cls, exception)
IndexError: cannot slice ListArray (of length 8) with [[1], [0], [0], [1], [1], [0, 2, 3], [0], [1]]: index out of range while attempting to get index 3 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-17/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)
This error occurred while attempting to slice
<Array [[[{px: -0.181, ...}, ...], ...], ...] type='4 * var * var * {px...'>
with
<Array [[[1], [0]], [[0], ...], ..., [[0], [1]]] type='4 * var * var * int32'>
due to trying to use the out of range indices.
This is not a problem if we just run the unmasked array
fastjet.ClusterSequence(eg, jetdef).exclusive_jets_constituent_index(njets=2)
which gives
[[[1], [0]],
[[0], [1]],
[[0], [1]],
[[0], [1]],
[[0], [1]]]
---------------------------
type: 5 * var * var * int32
Thank you for the help on this issue, it will be greatly appreciated!
~~I'm assuming that the error happens in eg[mask], not in fastjet.ClusterSequence or exclusive_jets_constituent_index, right?~~
~~Where does ak.num(eg, axis=1) not equal ak.num(mask, axis=1)?~~
~~Or if str(mask.type) has two "vars" just like str(eg.type), where does ak.num(eg, axis=2) not equal ak.num(mask, axis=2)?~~
~~That would be the first step. The next step is figuring out why they're not equal, because computing eg[mask] presupposes that they fit together.~~
I made the wrong assumption. You showed that the error definitely happens in exclusive_jets_constituents, but it looks like an Awkward slicing error, somewhere in its Python implementation.
Hi @jpivarski, OK! So .... what is best for us to do here? Would you like us to open an issue in Awkward, or should we leave that to you? Since the working example we have relies on Fastjet, it's not entirely clear to us how to provide useful feedback to Awkward on this. Thoughts welcome!!
The crossed-out part was assuming that it's a bug in Awkward (in eg[mask]). It might still be, but it's also possible that exclusive_jets_constituents is using it wrong.
It's implemented here:
https://github.com/scikit-hep/fastjet/blob/0ede4d336453d327184bc7b3af1988e19e8f73bb/src/fastjet/_multievent.py#L298-L307
You've verified that exclusive_jets_constituent_index returns a value, so it should be possible to walk through each one of those steps. Do the results of the intermediate steps look right?
Hi @jpivarski, the error is thrown on line 307 when slicing prepared. The real issue seems to be happening before that in exclusive_jets_constituent_index since it is returning indices out of range. I tried to look into this a bit but all I could tell was that line 208 already had these out of range indices returned from to_numpy_exclusive_njet_with_constituents. I couldn't find a way to look any further into this.
So exclusive_jets_constituent_index returned a value, but it was the wrong value because there are indexes out of bounds. (It also means that there is no pure Python work-around, since you can't switch to using exclusive_jets_constituent_index instead of exclusive_jets_constituent.)
I followed exclusive_jets_constituent_index back to the C++ routine, which is to_numpy_exclusive_njet_with_constituents. The error must be in here:
https://github.com/scikit-hep/fastjet/blob/0ede4d336453d327184bc7b3af1988e19e8f73bb/src/_ext.cpp#L365-L436
which is unpacking the NumPy array inputs, constructing FastJet objects, running FastJet algoriths, and then packing the results into NumPy array outputs. Since the issue is that an index is off, it's probably not the translation between NumPy arrays and FastJet objects, but in handling the FastJet objects. Maybe an off-by-one error somewhere?
Hi, after looking into this a bit more I've realized that this is a problem with more than just exclusive_jets_constituent_index, though that is the function where the differences are apparent enough to cause noticeable problems. I believe the problem is most likely occurring in ClusterSequence. If I take the same example as I originally used, with the array eg, there are problems happening in the same place for the arrays returned from the functions exclusive_jets and inclusive_jets as well. This is noticeable when mask is applied before clustering vs not applied until afterwards.
>>> fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)
[[{px: 0.377, py: 0.116, pz: -0.0749, E: 0.425}, {px: -0.181, ...}],
[{px: 0.54, py: -0.65, pz: -0.00527, E: 0.857}, {px: 0.253, ...}],
[{px: 4.65, py: -1.88, pz: -3.29, E: 6}, {px: 14.5, py: -1.73, ...}],
[{px: -3.55, py: -1.64, pz: -0.0941, E: 3.91}, {px: -1.33, py: ..., ...}]]
---------------------------------------------------------------------------
type: 4 * var * Momentum4D[
px: float64,
py: float64,
pz: float64,
E: float64
]
>>> fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)
[[{px: 0.377, py: 0.116, pz: -0.0749, E: 0.425}, {px: -0.181, ...}],
[{px: 0.54, py: -0.65, pz: -0.00527, E: 0.857}, {px: 0.253, ...}],
[{px: 0.294, py: 0.254, pz: -0.259, E: 0.467}, {px: 4.65, py: ..., ...}],
[{px: 1.45, py: -0.179, pz: -0.876, E: 1.71}, {px: 12.8, py: -1.8, ...}],
[{px: -3.55, py: -1.64, pz: -0.0941, E: 3.91}, {px: -1.33, py: ..., ...}]]
---------------------------------------------------------------------------
type: 5 * var * Momentum4D[
px: float64,
py: float64,
pz: float64,
E: float64
]
>>> mask
[True,
True,
False,
True,
True]
--------------
type: 5 * bool
Specifically I noticed that fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2] should be equal to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[mask][2] (which is equivalent to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3]). However, we have
>>> fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2]
[{px: 4.65, py: -1.88, pz: -3.29, E: 6},
{px: 14.5, py: -1.73, pz: -8.31, E: 17}]
-----------------------------------------
type: 2 * Momentum4D[
px: float64,
py: float64,
pz: float64,
E: float64
]
>>> fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[mask][2]
[{px: 1.45, py: -0.179, pz: -0.876, E: 1.71},
{px: 12.8, py: -1.8, pz: -7.18, E: 14.8}]
---------------------------------------------
type: 2 * Momentum4D[
px: float64,
py: float64,
pz: float64,
E: float64
]
Instead, fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2][0] is equal to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[2][1].
Also, fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2][1] is equal (within a decimal place) to the sum of fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[2][0], fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3][0], and fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3][1].
Please let me know if any of this is unclear and thank you for all the help with this!
You are absolutely right. The way we handle masked arrays is wrong. Running
px, py, pz, E, offsets = self.extract_cons(self.data)
print(offsets)
gives
array([ 0, 2, 4, 8, 10])
The correct layout is
>>> eg[mask].layout
<ListArray len='4'>
<starts><Index dtype='int64' len='4'>
[0 2 6 8]
</Index></starts>
<stops><Index dtype='int64' len='4'>
[ 2 4 8 10]
</Index></stops>
...
This explains why you see all these extra elements between offsets 4 and 8. I think we should implement this function with start & stop offsets instead of using only the stop as the offset. @jpivarski any suggestions would be very welcome. Do you think my proposal makes any sense?
It sounds like a good idea. (I haven't looked deeply into the details.)
@chrispap95 / @jpivarski any movement on this? It's blocking progress for future collider analysis, it would nice to have it sorted out.