zarr-python
zarr-python copied to clipboard
ArrowRecordBatchCodec and vlen string support
The discussion in https://github.com/zarr-developers/zeps/pull/47 got me thinking: what if, instead of turning numpy arrays into bytes, we turn them into self-describing Arrow Record Batches and serialize them using the Arrow IPC format.
This would be a new type of Array -> Bytes codec. The beautiful thing about this is that it gives us variable-length string encoding for free (as well as potentially many other benefits) -- xref https://github.com/zarr-developers/zarr-specs/issues/83.
This PR is a proof of concept that this is feasible and in fact very easy.
There is a lot more to explore here, but I thought I would just through this up for discussion.
TODO:
- [x] Add unit tests and/or doctests in docstrings
- [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [ ] GitHub Actions have all passed
- [ ] Test coverage is 100% (Codecov passes)
This experiment also suggests another interesting possibility: returning Arrow Arrays and Tables from a Zarr Array or Group. If the Zarr Arrays are all 1D, they can be represented as Arrow Arrays all the way through, and there are potentially opportunities to reduce memory copies. We could have ArrowBuffer / ArrowArrayBuffer types.
This sounds really interesting and potentially very powerful @rabernat!
Would you mind commenting on the implications of a pyarrow dependency?
Would you mind commenting on the implications of a pyarrow dependency?
I feel like it is becoming as ubiquitous as numpy in the ecosystem, so I don't consider this a major blocker. Or it could be an optional dependency if you want to read data encoded this way. But I'd be curious to hear opinions on that.
There's lots of feedback in https://github.com/pandas-dev/pandas/issues/54466 on pandas adopting pyarrow as a required dependency. The primary concern raised is the size of the package, especially in serverless contexts (though it seems like AWS Lambda has some built-in support to make this not so much of an issue?).
There's some work being done in pyarrow to make core pieces available without having to bring in everything.