arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] How to handle chunked arrays output in pyarrow.array(...)

Open jorisvandenbossche opened this issue 1 year ago • 3 comments

From https://github.com/apache/arrow/pull/34289#pullrequestreview-1355094099

Currently, the pyarrow.array(..) constructor is meant to create Array object, but can return a ChunkedArray instead in two cases: 1) the object is too big to fit into a single array (eg offset gets too large for single StringArray), and 2) the object has a __arrow_array__ that returns a ChunkedArray.

However, if this starts to happen more and more, that can be annoying for places in our code where we assume pyarrow.array(..) gives us an Array, see for example https://github.com/apache/arrow/issues/33727#issuecomment-1387323624. For this specific case, we updated pyarrow.array(..) to special case chunked arrays with only 1 chunk to unpack this into a normal Array, since that's an easy zero-copy conversion (done in https://github.com/apache/arrow/pull/34289).

Longer term, what do we want to do with pyarrow.array(..) returning chunked arrays?

For example, passing a pandas.Series to pyarrow.array(..) can easily give a ChunkedArray:

>>> arr = pa.chunked_array([[1, 2], [3, 4]])
>>> ser = pd.Series(arr, dtype=pd.ArrowDtype(arr.type))
>>> ser
0    1
1    2
2    3
3    4
dtype: int64[pyarrow]
>>> pa.array(ser)
<pyarrow.lib.ChunkedArray object at 0x7effe3f9ea90>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

Some thoughts:

  • To ensure you can rely more on pa.array(..) to actually return an Array, we could concat chunked arrays in the example above, and then users could use pa.asarray(..) to get either Array or ChunkedArray
  • Keep pa.array(..) as flexible giving either Array/ChunkedArray, but add other function that is more strict, or a helper that ensures we always have an Array and concats chunks if necesssary, which could then be used internally where needed.

jorisvandenbossche avatar Mar 28 '23 08:03 jorisvandenbossche

Maybe this is a tangent, but I think this question gets at how complex we want arrays to be. I sometimes wish whether an array is chunked or not were an implementation detail, rather than a top-level type. This is especially when considered in combination with other array differences. A good example of this is string arrays: between chunked and contiguous, indices size, and encodings, there are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray).

All 36 string arrays in PyArrow
strings = ["hello", "world"]
# Can have i32 or i64 indices:
pa.array(strings, pa.utf8())
pa.array(strings, pa.large_utf8())
# Can also be chunked
pa.chunked_array(strings, pa.utf8())
pa.chunked_array(strings, pa.large_utf8())
# Can be dictionary encoded (with different indices width)
pa.array(strings, pa.dictionary(pa.int32(), pa.utf8()))
pa.array(strings, pa.dictionary(pa.int8(), pa.large_utf8()))
# Can be run-end encoded
pa.array(strings, pa.ree(pa.utf()))
# Can be any combination of the above
pa.chunked_array(strings, pa.ree(pa.dictionary(pa.int8(), pa.utf8())))
num_possible_indices = 2 # i32 or i64
num_possible_chunking = 2 # contiguous or chunked
num_possible_encodings = 3 # dictionary, ree, or ree + dictionary
num_possible_dictionary_index = 4

2 * 2 * ((2 * 4) + 1) = 36 possible string arrays in arrow

These can be one of the following Python classes:

ChunkedArray
StringArray
LargeStringArray
RunEndArray
DictionaryArray

I'd be in favor of keeping pa.array() returning either Array/ChunkedArray, since it's a high level function and I think I'd rather our higher-level APIs not care as much about the buffer layout.

wjones127 avatar Mar 29 '23 19:03 wjones127

What about adding a chunk_if_needed param to pa.array? It could default to True and the existence of the parameter should inform the user of the possibilities.

here are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray)

Note that StringArray, LargeStringArray, RunEndArray, and DictionaryArray are all subclasses of Array. So I still think ChunkedArray is a unique problem.

In my fantasy world chunked arrays don't exist and tables are just lists of record batches :)

westonpace avatar Mar 31 '23 21:03 westonpace

Cross referencing an external Apache Spark Pandas UDF bug which refers to this issue: https://issues.apache.org/jira/browse/SPARK-46776

JoeEdwardsGitHub avatar Mar 12 '24 15:03 JoeEdwardsGitHub