Pillow
Pillow copied to clipboard
Arrow Support
Following on to some of the discussion in #1888, specifically here: https://github.com/python-pillow/Pillow/issues/1888#issuecomment-2112706708
Rationale
Arrow is the emerging memory layout for zero copy sharing of data in the new data ecosystem. It is an uncompressed columnar format, specifically designed for interop between different implementations and languages. It can be viewed as the spiritual successor to the existing numpy array interface that we provide. The arrow format is supported by numpy 2, pandas 2, polars, pyarrow, and arro3, and others in the python ecosystem.
What Support means
- The ability to export an image to an arrow array and read/process that data with no memory copies
- The ability to read an image in arrow array storage with 0 copies.
Technical Details
(Apache docs are here: https://arrow.apache.org/docs/format/Columnar.html)
An Arrow Schema is a set of metadata, containing type information, and potentially child schemas. An Arrow Array has an (implicitly) associated schema, metadata about the length of the storage, as well as a buffer of a contiguously allocated chunk of memory for the data. The Arrow Array will generally have the same parent/child arrangement as the schema structure.
obj.__arrow_c_schema__()must return a PyCapsule with anarrow_schemaname and an arrow schema struct.obj.__arrow_c_array__(schema=None)must return a tuple of the schema above and a PyCapsule with anarrow_arrayname and an arrow array struct. The schema is advisory, caller may request a format.
The lifetime of the Schema and Array structures is dependent on the caller -- so there are release callbacks that must be called when the caller is done with the memory. This complicates the lifetime of our image storage.
We have two cases at the moment:
- single channel image
- multichannel image
A single channel image can be encoded as a single array of height*width items, using the type of the underlying storage. (e.g., uint8/int32/float32).
A multichannel image can be encoded in a similar manner, using 4*height*width items in the array. The caller would be responsible for knowing that it's 4 elements per pixel. It's also possible to use a parent type of a FixedWidthArray of 4 elements, and a child array of 4*height*width elements. The fixed width arrays are statically defined, so the underlying array is still the same continuous block of memory.
Flat:
<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
20,
21,
67,
255
17,
18,
62,
255
...
Nested:
<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
[
20,
21,
67,
255
],
[
17,
18,
62,
255
],
An alternate encoding of a multichannel image would be to use a struct of channels, e.g. Struct[r,g,b,a]. This would require 4 child arrays, each allocated in a continuous chunk, as in planar image storage. This is not compatible with our current storage.
While our core storage is generally compatible with this layout, there are three issues:
- The block allocator in ImagingAllocateArray packs a number of scanlines in a 16mb block, leaving empty space at the end of the block. This limits the array length to < 1 16mb block. This is not an issue with the single chunk ImagingAllocateBlock, which does the image in one chunk. (note, blocks for the array allocator, arrow arrays fully work with the block allocator. Naming is hard.) This may be possible to work around with the streaming interface.
- Some modes have line length padding (BGR;15, BGR;24), and will not work without copying.
- Some modes have ignored pixel bands (LA/PA). This is a documentation issue for consumers.
Implementation Notes
The PR #8330 implements Pillow->Arrow for images that don't trip the above caveats.
There are no additional build or runtime dependencies. The arrow structures are designed to be copied into a header and used from there. (licensing is not an issue as those fragments are under an Apache License). There is an additional test dependency on PyArrow at the moment. In theory, numpy 2 could be used for this, but I'm not sure if we'd be testing the legacy array access or arrow access.
The lifetime of the core imaging struct is now separated from the imaging Python Object. There's effectively a refcount implemented for this -- there's an initial 1 for the image->im reference, every arrow array that references an image increments it, and calling ImagingDelete decrements it.
Outstanding Questions
For consumers of data -- what's the most useful format?
- Flat array
arr[(y*(width)+x)*4 + channel] - or Fixed Pixel array
arr[y*(width)+x][channel]? - Would it make sense to embed this into a set of FixedArrays that are a line length,
arr[y][x][channel]?
The Variable-size Binary View Layout supports multiple data buffers, though it seems like that's designed more for a list of strings, so I'm not sure how it would handle image data.
I don't see where a variable length structure would really gain us anything -- We'd have to construct an offset buffer, we'd lose actual types, and we still wouldn't be able to splice multiple allocation blocks together.
Well, like I said, I'm not sure how it would handle image data. I just noticed that that seems to be the only way to provide multiple data buffers. Arrow requiring all data to be in a single contiguous buffer just seems absurd to me.
It looks like PyArrow has a way to handle that: https://arrow.apache.org/docs/python/data.html#tables
Also, it might not be efficient, but there's a way to convert a NumPy array to an Arrow array. Since Pillow already supports NumPy, this might be an easy way to get something working before doing things in C to make it faster.
@Yay295 I think from a utility point of view, we'd want to be exposing band level values. Binary chunks aren't going to be nearly as useful if they have to be interpreted. There are also some alignment issues that would come from that, at least for large binaries (64 byte boundaries). It also wouldn't solve the core issue of the storage needing to be continuous.
At the moment, the np array calls require a memory copy, e.g. a tobytes call into a buffer that's then shared. The trouble here is that the memory copy is only required for the biggest images, which is kind of the wrong way to go. They'd already work if they were allocated using imaging._new_block().
It looks like what PyArrow is doing with the table is effectively the __arrow_c_stream__ which returns a sequence of arrow arrays, and copies them into a single arrow array for further export. It looks like the stream and array interfaces are effectively interchangeable, so we can implement one or both of them.
Would there ever be a future where we might account for chroma subsampling in ImagingMemoryInstance? If so, I imagine we might also use a null arrow_band_format for that?
I'd think the best way to accomplish that would be with planar image storage. My understanding of subsampling is that the resolution of one of the channels is effectively 1/2 or 1/4 of the resolution of the other bands. If we did this with planar storage, chroma would just be a uint8 image with 1/4 of the pixels.
Alternately, it could be stored as a null mapping in the validity buffer. (which we're not currently handling, but would probably be appropriate for the 2 and three channel image formats (pa/la/rgb/hsv). For subsampling, we could null out every nth item in a particular channel.
I think the first approach might be complicated a bit for 10- and 12-bit images (or maybe not, besides the fact that it wouldn't be a uint8 image). In case it is at all useful or relevant: libavutil in ffmpeg uses two structs, AVPixFmtDescriptor and AVComponentDescriptor (see pixdesch.h and pixdesc.c), to describe the various pixel storage formats it supports.
For multi-channel images (assuming each channel has the same data type and dimensions) you could represent that as an array with type Fixed Shape Tensor.
I've just put in a comment on that in here: https://github.com/apache/arrow/issues/43831#issuecomment-2318186432 -- what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list (+w).
what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list (
+w).
Yeah that's it. Plus extra extension metadata on the field
PR #8330 has been updated to read an arrow array as a pillow image.
Now that #8330 has been merged, would you mind restating what is left to do in this issue?
I think it's this, at least in part:
While our core storage is generally compatible with this layout, there are three issues:
-
The block allocator in ImagingAllocateArray packs a number of scanlines in a 16mb block, leaving empty space at the end of the block. This limits the array length to < 1 16mb block. This is not an issue with the single chunk ImagingAllocateBlock, which does the image in one chunk. (note, blocks for the array allocator, arrow arrays fully work with the block allocator. Naming is hard.) This may be possible to work around with the streaming interface.
-
Some modes have line length padding (BGR;15, BGR;24), and will not work without copying.
-
Some modes have ignored pixel bands (LA/PA). This is a documentation issue for consumers.
I think this is really where I'm thinking -- the currently unsupported features: https://github.com/python-pillow/Pillow/blob/main/docs/reference/arrow_support.rst#unsupported-features
- A streaming interface might be useful for images allocated using the arena allocator.
- The line padding modes will never work with arrow.
It will really depend on how people use this and the feedback we get.
When reading the return value of __arrow_c_array__() am I seeing it correctly that there is no way to to get the format of the image?
An arrow rust library tells me that the image is an array of u32s (FixedSizeList(Field { name: "pixel", data_type: UInt8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 4))
But I can't distinguish between RGB (exported as RGBX) and RGBA.
The pixel name seems to be written in: https://github.com/python-pillow/Pillow/blob/8ccdc399df1254c89bdb4e8fda6d6daf98943ab6/src/libImaging/Arrow.c#L120
band_names seems to contain the format of the data: https://github.com/python-pillow/Pillow/blob/8ccdc399df1254c89bdb4e8fda6d6daf98943ab6/src/libImaging/Storage.c#L149
But that seems to be only used if im->bands == 1.
Not sure what the solution would be. But the result of __arrow_c_array__ should contain information to distinguish the mode of the image.
@Joshix-1 That's definitely an issue that I'm aware of. We do have the band names, and they were added for that purpose, but I wasn't clear on what the standards are for encoding that sort of metadata.
Does the software you're using have expectations about the metadata format? Are you aware of any standards there?
I don't know if there is a standard way to do that. For me it would be enough if instead of "pixel" it said "RGBX" or "RGBA". But if there is a standard to encode that information that would be better
diff --git a/src/libImaging/Arrow.c b/src/libImaging/Arrow.c
index ccafe33b..38f6fa40 100644
--- a/src/libImaging/Arrow.c
+++ b/src/libImaging/Arrow.c
@@ -117,7 +117,8 @@ export_imaging_schema(Imaging im, struct ArrowSchema *schema) {
schema->n_children = 1;
schema->children = calloc(1, sizeof(struct ArrowSchema *));
schema->children[0] = (struct ArrowSchema *)calloc(1, sizeof(struct ArrowSchema));
- retval = export_named_type(schema->children[0], im->arrow_band_format, "pixel");
+ char name[5] = {im->band_names[0][0], im->band_names[1][0], im->band_names[2][0], im->band_names[3][0], '\0'};
+ retval = export_named_type(schema->children[0], im->arrow_band_format, name);
if (retval != 0) {
free(schema->children[0]);
free(schema->children);
or maybe with support for longer band_names
diff --git a/src/libImaging/Arrow.c b/src/libImaging/Arrow.c
index ccafe33b..2f147935 100644
--- a/src/libImaging/Arrow.c
+++ b/src/libImaging/Arrow.c
@@ -117,7 +117,13 @@ export_imaging_schema(Imaging im, struct ArrowSchema *schema) {
schema->n_children = 1;
schema->children = calloc(1, sizeof(struct ArrowSchema *));
schema->children[0] = (struct ArrowSchema *)calloc(1, sizeof(struct ArrowSchema));
- retval = export_named_type(schema->children[0], im->arrow_band_format, "pixel");
+ char name[9] = {};
+ size_t name_len = 0;
+ for (size_t i = 0; i < 4; ++i) {
+ strcpy(name + name_len, im->band_names[i]);
+ name_len += strlen(im->band_names[i]);
+ }
+ retval = export_named_type(schema->children[0], im->arrow_band_format, name);
if (retval != 0) {
free(schema->children[0]);
free(schema->children);
works for me personally. Not sure if someone would require the name to be "pixel" instead of a more useful "RGBX" or "RGBA" . (I'm not a c programmer, so I'm not sure if the diffs are a good idea, and I didn't test neither with YCbCr)
I'm not aware of any standards for encoding image storage formats, but (as I mentioned in an earlier comment) I do think the AVPixFmtDescriptor structs in ffmpeg's libavutil are a very useful reference for implementing one. Not as something Pillow ought to copy exactly. But the properties it uses are the same as what any library would need to exhaustively describe the possible ways that image data can be stored in memory. These include:
- the number of channels for each pixel
- whether the data is planar
- whether or not the pixel format uses floating point values
- endianness
- the presence of an alpha channel
- chroma subsampling information
- whether the components are XYZ, YUV, or RGB-like, along with an array of structs that encode the storage characteristics of each channel (e.g. the bit depth, for planar data the mapping of the band to the plane, or else the bitshift required to get the pixel data for a channel, etc.)
Arrow provides a space for arbitrary key-value metadata on each Field. This is often used by Arrow extension types in the ARROW:extension:name and ARROW:extension:metadata fields of that metadata, however you can also put your own metadata in there if you'd like.
As an example you could have "pillow": [encoded json blob of image information] on the field metadata exposed through the pycapsule interface, so that other consumers aware of pillow could explicitly look for that metadata.
@kylebarron So, in the ArrowSchema.metadata, where there's a key-value store, you'd recommend a key of pillow and metadata of a json payload? Or ARROW:extension:name: 'pillow and Arrow:extension:metadata: json blob
That depends on whether you want to add implementation-specific metadata or create a whole new logical type. I figure the former would be narrower and you could just document that key in your arrow interface documentation.
After speaking with @wiredfool in person I agree FixedShapeTensorArray is indeed the best option for this use case and adding metadata under somesort of "app_metadata" namespace.
So: Assuming that we have a 100x200 (w*h) CMYK image, the extension schema metadata for the top level would be:
arrow.fixed_shape_tensor: {"shape": [100,200,4], "dim_names": ["W", "H", "C"]}
edit
This may not be correct -- I think that this is assuming that we'll have an array of images, so there's an implied n=1 in the NCHW format. However, in our implementation, we're returning a length = w*h, so length is in pixels. I think that to do this properly with respect to the fixed length tensor, we'd wind up wrapping our current image array with an additional one element array for that N dimension.
/edit
But this still isn't getting the band metadata. The only mention of color in the spec is:
Example with uniform_shape metadata for a set of color images with fixed height, variable width and three color channels: { "dim_names": ["H", "W", "C"], "uniform_shape": [400, null, 3] }
So again, we're into defining a standard. One possibility would be:
{ "dim_names": ["H", "W", ["C", "M", "Y", "K"]], ... }
However that's going to be somewhat iffy for backwards compatibility. Alternately, a separate extension for specific image metadata could be added on to the metadata list.
image: {"bands": ["C", "M", "Y", "K"]}
note for implementing -- the encoding of the metadata is
int32: number of key/value pairs (noted N below)
int32: byte length of key 0
key 0 (not null-terminated)
int32: byte length of value 0
value 0 (not null-terminated)
...
int32: byte length of key N - 1
key N - 1 (not null-terminated)
int32: byte length of value N - 1
value N - 1 (not null-terminated)
Another option for additional metadata for tensors:
{ "dim_names": ["H", "W", "C"], "application": {"dim_map":{"C": ["R", "G", "B"]}}}