pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

First class array/list type

Open wesm opened this issue 8 years ago • 2 comments

Similar to the ARRAY type found in SQL variants with nested types. See also the List type in Apache Arrow.

xref pydata/pandas#8517

wesm avatar Sep 17 '16 20:09 wesm

"First class" here means "not implemented using Python lists". You can interpret any array of type T as Array[T] by adding an array of offsets that encode size and position.

For example, the data

[[0, 1, 2],
 [3],
 [],
 [4, 5, 6]]

can be represented compactly as

offsets: [0, 3, 4, 4, 7]
data: [0, 1, 2, 3, 4, 5, 6]

There are other possible representations. This one is good because flattening (for flatmap function or flatten) is zero copy, and it's highly cache-efficient for scanning. Downside is that mutability is more costly. I would argue that we should not be encouraging such structures to be mutated anyway

wesm avatar Sep 17 '16 20:09 wesm

I like this idea. Numpy's multi-dimensional requirements makes it really difficult to make an unboxed array of heterogeneously sized arrays.

chrisaycock avatar Sep 17 '16 21:09 chrisaycock