arrow
arrow copied to clipboard
[Python] How to handle chunked arrays output in pyarrow.array(...)
From https://github.com/apache/arrow/pull/34289#pullrequestreview-1355094099
Currently, the pyarrow.array(..)
constructor is meant to create Array object, but can return a ChunkedArray instead in two cases: 1) the object is too big to fit into a single array (eg offset gets too large for single StringArray), and 2) the object has a __arrow_array__
that returns a ChunkedArray.
However, if this starts to happen more and more, that can be annoying for places in our code where we assume pyarrow.array(..)
gives us an Array, see for example https://github.com/apache/arrow/issues/33727#issuecomment-1387323624. For this specific case, we updated pyarrow.array(..)
to special case chunked arrays with only 1 chunk to unpack this into a normal Array, since that's an easy zero-copy conversion (done in https://github.com/apache/arrow/pull/34289).
Longer term, what do we want to do with pyarrow.array(..)
returning chunked arrays?
For example, passing a pandas.Series to pyarrow.array(..)
can easily give a ChunkedArray:
>>> arr = pa.chunked_array([[1, 2], [3, 4]])
>>> ser = pd.Series(arr, dtype=pd.ArrowDtype(arr.type))
>>> ser
0 1
1 2
2 3
3 4
dtype: int64[pyarrow]
>>> pa.array(ser)
<pyarrow.lib.ChunkedArray object at 0x7effe3f9ea90>
[
[
1,
2
],
[
3,
4
]
]
Some thoughts:
- To ensure you can rely more on
pa.array(..)
to actually return an Array, we could concat chunked arrays in the example above, and then users could usepa.asarray(..)
to get either Array or ChunkedArray - Keep
pa.array(..)
as flexible giving either Array/ChunkedArray, but add other function that is more strict, or a helper that ensures we always have an Array and concats chunks if necesssary, which could then be used internally where needed.
Maybe this is a tangent, but I think this question gets at how complex we want arrays to be. I sometimes wish whether an array is chunked or not were an implementation detail, rather than a top-level type. This is especially when considered in combination with other array differences. A good example of this is string arrays: between chunked and contiguous, indices size, and encodings, there are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray).
All 36 string arrays in PyArrow
strings = ["hello", "world"]
# Can have i32 or i64 indices:
pa.array(strings, pa.utf8())
pa.array(strings, pa.large_utf8())
# Can also be chunked
pa.chunked_array(strings, pa.utf8())
pa.chunked_array(strings, pa.large_utf8())
# Can be dictionary encoded (with different indices width)
pa.array(strings, pa.dictionary(pa.int32(), pa.utf8()))
pa.array(strings, pa.dictionary(pa.int8(), pa.large_utf8()))
# Can be run-end encoded
pa.array(strings, pa.ree(pa.utf()))
# Can be any combination of the above
pa.chunked_array(strings, pa.ree(pa.dictionary(pa.int8(), pa.utf8())))
num_possible_indices = 2 # i32 or i64
num_possible_chunking = 2 # contiguous or chunked
num_possible_encodings = 3 # dictionary, ree, or ree + dictionary
num_possible_dictionary_index = 4
2 * 2 * ((2 * 4) + 1) = 36 possible string arrays in arrow
These can be one of the following Python classes:
ChunkedArray
StringArray
LargeStringArray
RunEndArray
DictionaryArray
I'd be in favor of keeping pa.array()
returning either Array/ChunkedArray, since it's a high level function and I think I'd rather our higher-level APIs not care as much about the buffer layout.
What about adding a chunk_if_needed
param to pa.array
? It could default to True
and the existence of the parameter should inform the user of the possibilities.
here are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray)
Note that StringArray
, LargeStringArray
, RunEndArray
, and DictionaryArray
are all subclasses of Array
. So I still think ChunkedArray
is a unique problem.
In my fantasy world chunked arrays don't exist and tables are just lists of record batches :)
Cross referencing an external Apache Spark Pandas UDF bug which refers to this issue: https://issues.apache.org/jira/browse/SPARK-46776