arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset

Open pvardanis opened this issue 1 year ago • 12 comments

We clarified the docstring of ListArray.values a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values

We should do the same for FixedSizeListArray.values, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values


Original report:

Describe the bug, including details regarding any error messages, version, and platform.

I have a pa.FixedSizeListArray that looks like this:

<pyarrow.lib.FixedSizeListArray object at 0x13f9b45e0>
[
  [
    [
      1,
      2
    ],
    [
      3,
      4
    ]
  ],
  [
    [
      5,
      6
    ],
    [
      7,
      8
    ]
  ]
]

I'm trying to flatten each row in the pa.FixedSizeListArray using .values. Doing array[0].values, returns:

<pyarrow.lib.FixedSizeListArray object at 0x13f8e7fa0>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

but doing array[0].values.values returns the initial array flattened, not the one that I want:

[
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8
]

Weird thing is that array[0].values.flatten() returns what it's supposed to:

[
  1,
  2,
  3,
  4
]

However, I don't want to use flatten() since this discards Null values. Is this a bug or am I missing something?

Component(s)

Python

pvardanis avatar May 15 '24 14:05 pvardanis

Thanks for reporting this @pvardanis. This might be expected as .values accesses underlying storage which will start at the beginning of an array. Could you say what version of pyarrow are you on? Also your array seems nested, which I'm not sure how to reproduce, can you give an example?

rok avatar May 15 '24 14:05 rok

@rok I'm using 12.0.1 but can also reproduce the bug in 16.1.0. This is how to create the array:

array_type = pa.list_(pa.list_(pa.int32(), list_size=2), list_size=2)
array = pa.array(
        [[[1, 2], [3, 4]], [[5, 6], [7, 8]]],
        type=array_type,
    )

pvardanis avatar May 15 '24 15:05 pvardanis

@pvardanis I can confirm this is still behavior on main branch. As per docs this is expected (but unintuitive):

values Return the underlying array of values which backs the FixedSizeListArray.

The reason this happens is that getting .values.values will give you a zero-copy reference to the underlying storage and loose the offset of value whose values you were getting. This has utility in that we avoid a copy. But it is somewhat unintuitive. .flatten on the other hand creates a copy from the offset you're getting and returns the correct values. You can work around this by either calculating your own offset and slice the .values.values or by doing a array.take(i).values (which will copy values at i) or you can use .flatten which will probably do the same as .take(i) (I'm not sure about this, maybe flatten copies the whole array).

@jorisvandenbossche would you say this is approximately right?

rok avatar May 15 '24 16:05 rok

@rok

array.take([0]).values.values

seems to be working, however I was hoping for a zero copy solution.

array.slice(offset=0, length=1).values.values

still returns the whole array. Also, I don't want to use .flatten since it gets rid of Null values.

pvardanis avatar May 15 '24 16:05 pvardanis

@pvardanis By calculating your own offset I mean something like this:

i = 1
array.values.values.slice(offset=i * array.type.list_size, length=array.type.list_size)
<pyarrow.lib.Int32Array object at 0x71226477b2e0>
[
  3,
  4
]

I think this should be zero-copy.

rok avatar May 15 '24 17:05 rok

@rok that works thanks!

pvardanis avatar May 16 '24 16:05 pvardanis

Great to hear! Closing the issue. If there's more reports like this we should maybe consider some kind of .flattened_slice_view method?

rok avatar May 16 '24 23:05 rok

Also, I don't want to use .flatten since it gets rid of Null values.

@pvardanis watch out for the fact that null lists can be "non-empty" in list arrays (the values array might contain values for that slot).

And in fixed_size_list arrays, the values in [i..i + fsl.list_size) are not necessarily null themselves. They could be anything.

felipecrv avatar May 17 '24 02:05 felipecrv

We clarified the docstring of ListArray.values a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values

We should do the same for FixedSizeListArray.values, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values

BTW, an important reason why the .values for a ListArray returns the non-offsetted child array is to ensure this still matches with the .offsets of a sliced ListArray (the offsets array itself will be sliced as well, but the actual values in it don't change. So for those offset values to be correct indices into the values array, that values array should still be the full non-sliced version).

Of course, that reason is not important for FixedSizeListArray given there are no offsets in this case, but it also makes sense to have .values be consistent between both.

jorisvandenbossche avatar May 17 '24 08:05 jorisvandenbossche

I am going to reopen this issue and rephrase the title as a documentation issue to clarify the docstring.

jorisvandenbossche avatar May 17 '24 08:05 jorisvandenbossche

Of course, that reason is not important for FixedSizeListArray given there are no offsets in this case...

Technically the offsets exist the same way, but are implicit because they can be computed with (arr.offset + i) * list_size.

Child arrays of nested layouts are required to keep the prefix-padding on the buffer when an upper level .offset > 0 to make it so that slicing an array, doesn't have to slice recursively. This makes the zero-copy .values that @pvardanis wants to do, a bit trickier, but on the other hand makes slicing arrays constant-time instead of O(depth_of_the_layout)

felipecrv avatar May 17 '24 19:05 felipecrv

@jorisvandenbossche perhaps *ListArray and FixedSizeListArray need a sliced_values() accessor. For list-views sliced_values would be hard to compute and explain to users though (because offsets are not necessarily sorted in list-views).

felipecrv avatar May 17 '24 19:05 felipecrv

I would like to claim this issue if it's still available.

PeopleMakeCulture avatar Jun 06 '24 15:06 PeopleMakeCulture

@PeopleMakeCulture the documentation PR is welcome, feel free to open a PR.

rok avatar Jun 06 '24 16:06 rok

Hey @PeopleMakeCulture - as this has been open a while without a PR, I'm gonna grab it, hope you don't mind! (Also, hey Jing, didn't realise this is you, see you at RC in a few weeks hopefully!)

thisisnic avatar Apr 15 '25 08:04 thisisnic

Issue resolved by pull request 46144 https://github.com/apache/arrow/pull/46144

AlenkaF avatar Apr 15 '25 10:04 AlenkaF