Remove pyarrow as hard dependency
Is your feature request related to a problem? Please describe.
pyarrow is a massive, monolithic dependency. It can be hard to install in some places, and can't currently be installed in Pyodide. It's certainly a monumental effort to get it to work in Pyodide, but I think it would be valuable for lonboard to wean off of pyarrow.
The core enabling factor here is the Arrow PyCapsule Interface. It allows Python Arrow libraries to exchange Arrow data at the C level at no cost. This means that we can interface at no cost with any user who's already using pyarrow, but not be required to use pyarrow ourselves. I've been promoting its use throughout the Python Arrow ecosystem (https://github.com/apache/arrow/issues/39195#issuecomment-2245718008), and hoping this grows into something as core to tabular data processing as the buffer protocol is to numpy.
As part of working to build the ecosystem, I created arro3, a new, very minimal Python Arrow implementation that wraps the Rust Arrow implementation.
I think that it should be possible to swap out pyarrow for arro3, which is about 1% of the normal pyarrow installation size.
It's also symbiotic for the ecosystem if Lonboard shows the benefits of modular Arrow libraries in Python.
Describe the solution you'd like
We'll keep pyarrow as a required dependency for GeoPandas/Pandas interop. pyarrow has implemented pyarrow.Table.from_pandas and that's not something I want to even think about replicating.
But aside from that, pretty much everything is doable in arro3 and geoarrow-rust.
- [ ]
pa.Table.from_arrays - [ ] Construct a table from named columns
- [ ] Write Parquet with specified compression and compression level
- [x] Access column from table, positionally
- [x] Access Schema from table and field from schema, positionally
- [x] Access individual arrays from a chunked array
- [ ] arr.flatten() and to_numpy()
- [x] Access metadata on field
- [ ] Construct a FixedSizeListArray from numpy coords and a list size (this is a bit harder, but is also doing a geoarrow operation that I should be able to do in geoarrow-rs anyways)
- [x] Constructor for ChunkedArray from a Python iterable of array objects
- [x] Access field metadata
CLI only:
Other notes:
- Add numpy as direct dependency
Primarily closed by #582
This will be closed with https://github.com/developmentseed/lonboard/pull/598 and #601