pandas2
pandas2 copied to clipboard
First class array/list type
Similar to the ARRAY
type found in SQL variants with nested types. See also the List
type in Apache Arrow.
xref pydata/pandas#8517
"First class" here means "not implemented using Python lists". You can interpret any array of type T
as Array[T]
by adding an array of offsets that encode size and position.
For example, the data
[[0, 1, 2],
[3],
[],
[4, 5, 6]]
can be represented compactly as
offsets: [0, 3, 4, 4, 7]
data: [0, 1, 2, 3, 4, 5, 6]
There are other possible representations. This one is good because flattening (for flatmap
function or flatten
) is zero copy, and it's highly cache-efficient for scanning. Downside is that mutability is more costly. I would argue that we should not be encouraging such structures to be mutated anyway
I like this idea. Numpy's multi-dimensional requirements makes it really difficult to make an unboxed array of heterogeneously sized arrays.