awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Fix: jagged slicing for `ListArray`

Open agoose77 opened this issue 3 years ago • 13 comments

As discussed below, this PR is intended to address #1406 and #1405, but the true fixes might be best in separate PRs. This PR is primarily here for a central discussion point around these bugs.

agoose77 avatar Apr 11 '22 15:04 agoose77

Codecov Report

Merging #1408 (4045c8d) into main (9e17f29) will increase coverage by 0.46%. The diff coverage is 63.84%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/numexpr.py 88.40% <0.00%> (ø)
src/awkward/_v2/_connect/pyarrow.py 88.46% <0.00%> (ø)
...awkward/_v2/_connect/rdataframe/from_rdataframe.py 0.00% <0.00%> (ø)
...c/awkward/_v2/_connect/rdataframe/to_rdataframe.py 0.00% <0.00%> (ø)
src/awkward/_v2/_lookup.py 98.68% <ø> (+1.17%) :arrow_up:
src/awkward/_v2/numba.py 93.47% <0.00%> (ø)
src/awkward/_v2/operations/ak_from_avro_file.py 66.66% <0.00%> (ø)
src/awkward/_v2/operations/ak_local_index.py 100.00% <ø> (ø)
src/awkward/_v2/operations/ak_max.py 87.09% <ø> (ø)
... and 63 more

codecov[bot] avatar Apr 11 '22 16:04 codecov[bot]

Is this addressing both #1405 and #1406? If so, adding them to the Development thing on the right-bar would be a good idea, to ensure that we don't forget to close the issues when this PR fixes them.

It is entirely plausible that a kernel is missing the starts[0] != 0 case, since starts[0] == 0 is so common. If that's what it is, it can also be relatively easy to diagnose. Once you've narrowed it down to a particular kernel (it looks like you have), print out the data going into and coming out of that kernel and manually increase all of the starts and stops by 1, 2, 3, etc.

I usually debug such things with a command-line one-liner, like this:

python -c 'import awkward as ak, numpy as np; a = ak._v2.Array([[1, 2, 3], [], [4, 5]]); a.layout.content._data = np.tile(a.layout.content._data, 2); a.layout.offsets._data += 1; print(a.layout)'

with print(a.layout) replaced by the call that triggers the kernel, and the handle on 1, 2, 3, etc. is on the right-hand side of the +=. Then up-arrow, modify, up-arrow, modify.

This is hacking, pure and simple, using private accessors to modify things in place and print-outs in the source code, but it narrows down on the issue a lot faster than a debugging library. (Which is why it's great that Python discourages hacking without disallowing it. It would take much longer to understand what's wrong if this were in Java...)

What you should expect to see is values going into or coming from the kernel not moving up and down as you vary the offset (or maybe going up and down when they shouldn't be).

jpivarski avatar Apr 11 '22 16:04 jpivarski

Right now, this is really just a staging ground for looking at the issue in #1405 / #1406. I think that there are several bugs going on here, some that are v2 specific, and some that are v1-v2. My plan would be to separate these into different PRs. So, this PR is really more about centralising a proof-of-bug ahead of actually fixing anything :)

The issue as I see it is not actually the kernel. I have no reason to believe that the kernel isn't doing what it is supposed to be. In #1406 I explored my early findings that it seems we are just doing the wrong thing, namely that we're taking the carry of one array and applying it to another. Certain implementation details / invariants mean that we don't see the bug that often, but I think fundamentally we need to extend it.

Often in these kernel bugfixing journeys I find one can go down a rabbit hole, but I am fairly confident we've narrowed down at least one problem-site.

agoose77 avatar Apr 11 '22 17:04 agoose77

@jpivarski OK, I've added a first set of tests that isolate one jagged slicing bug -> inner contents might not be mapped onto by the outer list/index types, but we currently raise an Exception in the ListArray _getitem_next_jagged if the list has a different length to the index.

agoose77 avatar Apr 11 '22 18:04 agoose77

OK, next set of tests look at a separate (though likely not independent) bug on the ListArray::_getitem_next_jagged(slicecontent=IndexedOptionArray(ListOffsetArray codepath: https://github.com/scikit-hep/awkward-1.0/blob/edfce3800ccd1f8b90734cf9594c3b5c117441ff/src/awkward/_v2/contents/listarray.py#L472-L476

As explained in my above comment (#1406), we're just taking the indices that map into the slicecontent and applying them to the IndexedOptionArray content. I believe this is wrong - in my debugging we end up with a ListArray with starts/stops equal to the offsets 0, 2, 2, rather than 2, 2, 3. This means that when Awkward tries to pull the 0th item of the second sublist, it fails as it has length 0.

agoose77 avatar Apr 11 '22 18:04 agoose77

This commit (d652c81) fixes #1406 by first computing the carry that would be required to convert the ListArray into a ListOffsetArray starting at 0. We definitely have a kernel for this, but it was easy just to write a nplike to do this for a proof of concept.

agoose77 avatar Apr 11 '22 19:04 agoose77

It is an explicit goal to replace kernels with nplike calls if there are no correctness or performance issues in doing so. That would make the v2 code different from the v1 code (which does still rely on kernels), but if you can reduce dependence on kernels in v2, more of them will be removed when we remove v1 support. That will be less code to maintain, test, and port to CUDA.

The "performance" exception cited above is if the nplike calls require more passes over the data. If one kernel is replaced by multiple nplike function applications, it might not be a good replacement. (It's likely that some algorithms implemented entirely in nplike will be going the other way, if they can be fewer passes/faster in a single kernel call.)

jpivarski avatar Apr 11 '22 19:04 jpivarski

I have noticed the odd kernel to which this applies, so I will keep that in mind! Here I'm not familiar enough with the slicing code to understand whether this is a symptom of a larger problem or not, hence the "HACK" commit message. I'm putting this down for now, as I think I've identified some of the bugs / solutions and I've run out of time!

agoose77 avatar Apr 11 '22 19:04 agoose77

Added small fix to convert the index to the appropriate array type. At present, the typetracer code does not support the magic methods for addition/subtraction, so this doesn't work. I haven't looked at the existing typetracer usage enough to know whether I need to handle this with a special case, or whether we should add support for these operations to typetracer (e.g. if that even makes sense in the context of what typetracer does). I've noticed that we have a number of places where we special case typetracer. I suspect that we will always need a bit of this, but perhaps we can further develop the abstraction to hide it?

agoose77 avatar Apr 16 '22 11:04 agoose77

I'm getting this out of context, but the TypeTracer is usually used wrapped in a high-level ak._v2.Array, which has those magic methods and passes them on to the underlying layout (containing TypeTracerArrays in this case), which eventually goes to the TypeTracer nplike's ufuncs. Maybe the latter are not implemented—it would be easy to add that, since it would only need to pass on whatever shape it gets and predict a dtype—but no new methods like __add__ and __sub__ should be implemented.

jpivarski avatar Apr 16 '22 20:04 jpivarski

@jpivarski in this context, I'm referring to the TypeTracerArray that we see as the data associated with a layout, e.g. the starts of a ListArray.

In this trial PR, I've followed the convention of taking stops (an Index) and extracting the underlying array with nplike.asarray. There is precedence for assuming that this result is ndarray-like, e.g., ak_where where we anticipate that the array has __eq__ https://cs.github.com/scikit-hep/awkward-1.0/blob/6ff41fb299167333d2715cae7c0904d8cd5dba7e/src/awkward/_v2/operations/structure/ak_where.py?q=asarray#L116

In this PR, when one tries to compute nplike.asarray(stops) - ... for a TypeTracerArray, they will see an error as we do not implement TypeTracerArray.__sub__.

I think the reason that this is at all apparent is that with v2 there is a much softer boundary between what should be done by a kernel, and what should be done with the nplike interface. Practically speaking, the nplike + ndarray interface is a well-tested, well known subset of useful kernels. Because we can now use nplike to do things that would have previously taken a kernel, we need to define this boundary.

I think we currently have an assumed "narrow" interface for nplike which is a subset of useful functions (in particular, searchsorted and friends) + the basic NDArrayOperatorsMixin interface on the array object. The question that I raised earlier is, I suppose:

how much like an ndarray should the TypeTracerArray behave, and in defining this, how should we define this interface so that Awkward code does not accidentally use unsupported methods?

The latter sentence is not a huge problem - we should pick this up in our tests, but in general I like to have the semantics / API well defined.

Currently, TypeTracerArray is much more restrictive than the other ndarray-likes, so we would need special-case code testing for TypeTracerArray. Whilst we cannot entirely eliminate this code in cases where we would require a concrete result (e.g. taking the length of an array), I think we can minimise the need for such code by providing the basic suite of ndarray-like operations for TypeTracerArray

agoose77 avatar Apr 16 '22 21:04 agoose77

That's true! I stand corrected—the TypeTracerArray will need to have __add__ and __sub__ methods, as well as a few others that might be needed by the codebase, like array.any(). We managed to eliminate all explicit uses of NumPy in favor of nplike, which channels the code path through the nplike functions that we control, but the codebase does still include method calls on the array objects, including those like __add__. Fortunately, for a completely artificial type like TypeTracerArray (as opposed to pre-existing libraries like CuPy), we can control that code path as well.

But you're right that TypeTracerArray.__add__ and TypeTracerArray.__sub__ will need to be written. (That's already the case for UnknownScalar, in the same library. The good news is that these are simple: __add__ and __sub__ are probably not even going to be different from each other.)

Channeling everything through nplike was an attempt to define an API—the complete set of functions used in our codebase on the arrays backing Index and NumpyArray are or should be in the list in nplike.py. Unfortunately, methods, and especially operators, are not as well controlled. To make the list of functions in nplike.py, I could do a text-based search (with the assumption that we never import X from numpy). To find all of the methods and operators used, in an automated way, we'd need a type-checker like mypy, as well as all the annotations (and no typing.Any!). We don't have that.

However, just adding the methods whenever we hit them in testing does work. All it misses is the certainty of knowing that we got them all

jpivarski avatar Apr 16 '22 23:04 jpivarski

The good news is that these are simple: add and sub are probably not even going to be different from each other.

Absolutely, these functions will all be identical apart from __truediv__ (and maybe any others I've forgotten?)

[nplike] was an attempt to define an API ... However, just adding the methods whenever we hit them in testing does work. All it misses is the certainty of knowing that we got them all

Yes, I don't think this is a pressing concern yet. Once we add type hints, we can implement a protocol for the ndarray interface that we expect.

agoose77 avatar Apr 17 '22 09:04 agoose77

I modernized this PR because it's addressing a bug in the Prioritized Issues list.

As it is, the only tests that fail are

tests/v2/test_1405-slicing-untested-cases.py::test_index_unmapped
tests/v2/test_1502-getitem-jagged-issue1406.py::test_success_nonempty_list

which are directly intended to test #1405 and #1406, so at least this PR does no harm. (It can't be merged because it doesn't fix the issue it's intended to, but not breaking other things is a big plus.)

jpivarski avatar Aug 15 '22 20:08 jpivarski

OK, so the reason that v2 doesn't agree with v1 for this test case is that we don't use the same kernel in ListArray to traverse a jagged slice. ListArray.cpp uses ListArray_getitem_jagged_apply_64 whilst listarray.py uses awkward_ListArray_getitem_jagged_descend

Clearly, we then use different logic to traverse the slices, with v1 performing a carry, and v2 computing the dense offsets (i.e. ensuring the offsets match after calling self.toListOffsetArray(True)). Crucially, no carry is performed in the latter case.

Explicitly, jagged slicing is failing when we try to slice the entire inner ListOffsetArray:

ak._v2.contents.ListOffsetArray(
        ak._v2.index.Index64(np.array([0, 1, 2, 3], dtype=np.int64)),
        ak._v2.contents.NumpyArray(np.array([2, 2, 2], dtype=np.int64)),
)

Rather than the projected one, i.e.

ak._v2.contents.ListOffsetArray(
        ak._v2.index.Index64(np.array([0, 1, 2], dtype=np.int64)),
        ak._v2.contents.NumpyArray(np.array([2, 2, 2], dtype=np.int64)),
)

We have code that ensures that the content length matches the length of the slice, which fails without this projection.

My "fix" (ugly though it is) for listarray.py in d7a1d53 handled this for the option-type slicing, by carrying only the mapped items. @ioanaif do you know why we changed the kernel here? I'm trying to get up to speed. It seems as though do need to carry, but it might be better to do it with a proper kernel rather than lots of nplike code.

agoose77 avatar Aug 30 '22 11:08 agoose77

@jpivarski I just remembered that Ioanna is working on CLAD this week, so perhaps I'm better pinging you to ask about the kernel change in my above comment?

agoose77 avatar Aug 30 '22 20:08 agoose77

I think there's no reason that v1 and v2 should ever differ on what kernels they use. In some cases, v2 replaces trivial kernels with the corresponding nplike call, but I don't think it should ever replace one kernel for another. The "I don't think" is based on the fact that we didn't rethink any of the getitem logic from v1 to v2; it was ported.

However, in this case, you're looking at the wrong function. C++ can dispatch by argument type but Python can't, so there was some subtlety in the porting. The getitem_next_jagged function that you point to:

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/libawkward/array/ListArray.cpp#L1827-L1832

is actually the specialization of that function in which the third argument is SliceArray64 (would be Index64 in v2).

There's also one in which the third argument is SliceMissing64 (would be IndexedOptionArray in v2):

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/libawkward/array/ListArray.cpp#L1887-L1892

and one in which the third argument is SliceJagged64 (would be ListOffsetArray in v2):

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/libawkward/array/ListArray.cpp#L1975-L1980

The equivalent in Python is to do the dispatching manually within the function, in if statements. The implementation you pointed to is

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/awkward/_v2/contents/listarray.py#L322

but there's also

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/awkward/_v2/contents/listarray.py#L365

(huh? not Index64?) and

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/awkward/_v2/contents/listarray.py#L434-L436

and

https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/awkward/_v2/contents/listarray.py#L556

(huh? slices can't be EmptyArray, thanks to the normalization described in https://github.com/scikit-hep/awkward/pull/1597#issuecomment-1227754769).

Both v1 and v2 use awkward_ListArray_getitem_jagged_descend on the SliceJagged64/ListOffsetArray case. But if you were debugging, following the wrong code path, that could have something to do with an error that you see, and maybe this code checking for the wrong slice types accounts for it. It certainly looks suspicious.

jpivarski avatar Aug 30 '22 22:08 jpivarski

I'm going to try to collect a complete list of getitem calls and what slice types they're assuming. Any line numbers correspond to main right now, which is commit dd2a3f400e29fc9ea908fc7d8267f592091457bb.

  • Content (content.py)
    • _getitem_next_field(self, head, tail, advanced)
    • _getitem_next_fields(self, head, tail, advanced)
    • _getitem_next_newaxis(self, tail, advanced)
    • _getitem_next_ellipsis(self, tail, advanced)
      • tail may contain Index64
    • _getitem_next_regular_missing(self, head, tail, advanced, raw, length)
      • head is Index64
    • _getitem_next_missing_jagged(self, head, tail, advanced, that)
      • head has content (is IndexedOptionArray)
    • _getitem_next_missing(self, head, tail, advanced)
      • head asserted to be IndexedOptionArray
    • _getitem(self, where)
      • where dispatched based on type (wide open to any user input)
  • BitMaskedArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • passed to ByteMaskedArray
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to ByteMaskedArray
    • _getitem_next(self, head, tail, advanced)
      • head dispatched by (), int, Index64, ListOffsetArray, str, list (of str), newaxis, Ellipsis, IndexedOptionArray
  • ByteMaskedArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged_generic(self, slicestarts, slicestops, slicecontent, tail)
      • no assumptions made about slicecontent (used as duck-typing)
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to self._getitem_next_jagged_generic (following v1 C++ constraints)
    • _getitem_next(self, head, tail, advanced)
      • head dispatched by (), int, Index64, ListOffsetArray, str, list (of str), newaxis, Ellipsis, IndexedOptionArray
  • EmptyArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
    • _getitem_range(self, where)
    • _getitem_field(self, where, only_fields=())
    • _getitem_fields(self, where, only_fields=())
      • where assumed to be a list
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • no assumptions (errors out)
    • _getitem_next(self, head, tail, advanced)
      • head dispatched by (), int, slice, str, list (of str), newaxis, Ellipsis, Index64, ListOffsetArray, IndexedOptionArray
  • IndexedArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged_generic(self, slicestarts, slicestops, slicecontent, tail)
      • no assumptions made on slicecontent (duck typed)
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to self._getitem_next_jagged_generic
    • _getitem_next(self, head, tail, advanced)
      • head dispatched by (), int, Index64, ListOffsetArray, str, list (of str), newaxis, Ellipsis, IndexedOptionArray
  • IndexedOptionArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged_generic(self, slicestarts, slicestops, slicecontent, tail)
      • no assumptions made on slicecontent (duck typed)
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to self._getitem_next_jagged_generic
    • _getitem_next(self, head, tail, advanced)
      • head dispatched as (), int, slice, Index64, ListOffsetArray, str, list (of str), newaxis, Ellipsis, IndexedOptionArray
  • ListArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • slicecontent dispatched by ListOffsetArray, NumpyArray (!?!), IndexedOptionArray, EmptyArray (!?!)
    • _getitem_next(self, head, tail, advanced)
      • head dispatched by (), int, slice, str, list (of str), newaxis, Ellipsis, Index64, ListOffsetArray, IndexedOptionArray
  • ListOffsetArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to ListArray._getitem_next_jagged
    • _getitem_next(self, head, tail, advanced)
      • head dispatched as (), int, slice, str, list (of str), newaxis, Ellipsis, Index64, ListOffsetArray, IndexedOptionArray
  • NumpyArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • errors out
    • _getitem_fields(self, where, only_fields=())
      • errors out
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • may pass slicecontent to RegularArray._getitem_next_jagged
    • _getitem_next(self, head, tail, advanced)
      • head dispatched as (), int, slice, newaxis, Ellipsis, str, list (of str), Index64, ListOffsetArray, IndexedOptionArray
  • RecordArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • where assumed to be str
    • _getitem_fields(self, where, only_fields=())
      • where assumed to be list (of str)
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to each item of contents
    • _getitem_next(self, head, tail, advanced)
      • head dispatched as (), str, list (of str), IndexedOptionArray, or passed to the _getitem_next of each item of contents
  • RegularArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to ListOffsetArray._getitem_next_jagged
    • _getitem_next(self, head, tail, advanced)
      • head dispatched as (), int, slice, str, list (of str), newaxis, Ellipsis, Index64, ListOffsetArray, IndexedOptionArray
  • UnionArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • where assumed to be int
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to each item of contents
    • _getitem_fields(self, where, only_fields=())
      • passed to each item of contents
    • _getitem_next_jagged_generic(self, slicestarts, slicestops, slicecontent, tail)
      • raises error if can't be simplified, otherwise passed to simplified._getitem_next_jagged
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to _getitem_next_jagged_generic
    • _getitem_next(self, head, tail, advanced)
      • head dispatched to (), int, slice, Index64, ListOffsetArray, str, list (of str), newaxis, Ellipsis, IndexedOptionArray
  • UnmaskedArray
    • _getitem_nothing(self)
    • _getitem_at(self, where)
      • passed to content
    • _getitem_range(self, where)
      • where assumed to be slice
    • _getitem_field(self, where, only_fields=())
      • passed to content
    • _getitem_fields(self, where, only_fields=())
      • passed to content
    • _getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
      • passed to content
    • _getitem_next(self, head, tail, advanced)
      • head dispatched as (), int, slice, Index64, ListOffsetArray, str, list (of str), newaxis, Ellipsis, IndexedOptionArray

The only red flags (labeled by "!?!") are the ones you found in ListArray. The rest all look correct.

You can use this, by the way, as a step into typing all the getitem methods. These are all of them (they all have "getitem" in their names, and that's what I searched for to make this).

jpivarski avatar Aug 30 '22 23:08 jpivarski

@jpivarski thank you for such a thorough investigation. I did a visual trace-through by eye, and clearly mistook the execution pathway. I like using the CLion debugger, but it's less convenient to run both CLion and PyCharm from the same location. Once I reconfigured my editors, it was clear that indeed I was looking at the wrong method. Additionally, I must have included the kernel prefix that is missing in C++, so I really did lead us on a wild goose chase!

I've found the cause of the difference between v1 and v2 - https://github.com/scikit-hep/awkward/blob/352b0dead74846ad2a56d385be4694ec87072a08/src/awkward/_v2/contents/listarray.py#L190-L205 is a new pathway that ultimately produces a ListOffsetArray with a length that exceeds the offsets. Allowing un-trimmed layouts is a common feature in the codebase, so I'm assuming that the best way to handle this is to explicitly trim the content ourselves.

agoose77 avatar Aug 31 '22 12:08 agoose77

This isn't ready for review just yet. I've fixed a slicing issue in ListArray._getitem_next_jagged, and corrected an older test that is asserting the wrong output. In addition, I've copied this older v2 test into v1 to show that the bug exists in both versions. We'll need to skip that until it's fixed in turn.

There is also an ugly hack in ListArray._getitem_next_jagged that I added at the start of this PR. I'll need to rethink that, hence moving to draft status.

agoose77 avatar Aug 31 '22 14:08 agoose77

OK, that was surprisingly easy. It's the same fix in two places.

agoose77 avatar Aug 31 '22 15:08 agoose77