polars
polars copied to clipboard
Feature request: Support for FixedSizeList
Along with @stuartlynn, I've been working on https://github.com/kylebarron/geopolars to extend polars to add support for geospatial data, much like GeoPandas extends Pandas (see also polars issues #1830, #3208).
With Arrow, the whole ecosystem benefits when a common memory layout is used. There's been a lot of work in https://github.com/geopandas/geo-arrow-spec to define common ways to store vector geospatial data (points, lines, polygons, etc) in Arrow memory. Right now, two alternate layouts are defined in the spec:
-
geoarrow.wkb
: use aBinary
column where geometries are stored in Well-Known Binary format. WKB is common in the geo-world, but this is a less performant storage format; coordinates can't be accessed with zero copy and parsing isO(n)
. - An Arrow-native encoding using a combination of
List
andFixedSizeList
(spec). This is more performant because geometry access to any coordinate is possible inO(1)
time and zero-copy access is possible. For example:- 2D Points:
geoarrow.point: FixedSizeList<f64>[2]
- 2D Lines:
geoarrow.linestring: List<FixedSizeList<f64>[2]>
- 2D MultiPolygons:
geoarrow.multipolygon: List<List<List<FixedSizeList<f64>[2]>>>
- 2D Points:
Therefore, to support the current version of the geo-arrow-spec
, FixedSizeList
would be a necessary data type.
Arrow2 supports FixedSizeList
. Beyond that, I don't know the polars
codebase well enough to know how much work it would be to add and support FixedSizeList
. Would it be possible to reuse existing List
support for FixedSizeList
?
Thoughts? I would be open to submitting a PR for this as well.
Appendix
Current Behavior
When trying to load this Arrow file (cities-geoarrow.arrow.zip), with schema:
pyarrow.Table
name: string
geometry: fixed_size_list<xy: double not null>[2]
child 0, xy: double not null
into Polars using table = pyarrow.feather.read_table(path); polars.from_arrow(table)
it errors with:
Cannot create polars series from FixedSizeList(Field { name: "xy", data_type: Float64, is_nullable: false, metadata: {} }, 2) type
Example files:
-
cities-geoarrow.arrow.zip: dataset with
Point
geometries in afixed_size_list<xy: double not null>[2]
column -
nationalpark.arrow.zip: dataset with
MultiPolygon
geometries in a column:geometry: list<item: list<item: list<item: fixed_size_list<xy: double not null>[2]>>> child 0, item: list<item: list<item: fixed_size_list<xy: double not null>[2]>> child 0, item: list<item: fixed_size_list<xy: double not null>[2]> child 0, item: fixed_size_list<xy: double not null>[2] child 0, xy: double not null
I have been thinking about this and is something that might fit in the scope of polars eventually. It is a lot of work with currently not much benefit with regard to the default list type. Eventually I'd like geotypes under the polars umbrella, but I first want to mature the default use case and have not a battle on two fronts.
Can Structs be used instead of FixedSizeLists? For 2-3 data points, I'm wondering if the list properties are relevant.
On our side that should work if we were to implement geo types.
My goal is be to be compliant with the GeoArrow specification in development. At this point, the spec defines a nested list format where the inner array is a FixedSizeList
. To the geo world, this is kind of the "best of both worlds" because the logical layout matches the coordinates
array from GeoJSON while the physical Arrow layout is a flat array of coords.
implement geo types
My preference is to not use a polars Object
type Series containing rust structs defined in the geo crate, because this has the usual non-Arrow drawbacks including serialization and deserialization costs every time data is loaded or shared with a program outside of polars.
Today, geopolars
still has an extra copy from Arrow data into geo
structs, but my long-term goal is to work with the geo
crate to restructure their algorithms around geometry traits, so that geometry data in Arrow can be accessed zero-copy (see https://github.com/georust/geo/discussions/838, https://github.com/georust/geo/issues/67).
Can Structs be used instead of FixedSizeLists?
To clarify, are you referring to rust structs or Arrow structs? Early on in GeoArrow discussions, an Arrow Struct format was proposed, but this was decided against because it is nearly identical to the physical layout of the nested list approach, while lacking the easier logical API of the nested lists.
It is a lot of work with currently not much benefit with regard to the default list type
I'm sympathetic to the extra dev overhead of new data types. I wonder whether it would be possible to add some sort of minimal "container" data type that just wraps Arrow arrays but doesn't have full polars support otherwise. In the current approach of geopolars
, we don't need or use any polars-specific methods on the geometry column (but the point is for users to access polars operations on all their non-geometry columns); we just access the underlying arrow data, pass it to an algorithm, and create a new series. So all we need is a way to store this more "custom" column data layout in a column alongside the rest of a polars DataFrame.
For 2-3 data points, I'm wondering if the list properties are relevant
Not sure I understand. For a geometry column of type Point
, each row of the geometry column would contain only 2 or 3 numbers. But for a geometry column of type Polygon
, each row could contain thousands of vertices. E.g. for a GeoDataFrame
representing countries where each row is one country and the geometry column includes the country's boundary, a single unsimplified geometry could include tens of thousands of vertices.
There are more requests for fixedsizelist + extension types so that we can deal with tensor types. I want to add fixedsizelist type as a minimal type. One that can be put into a DataFrame and supports minimal aggregations and take functionality. That should allow third parties to work with more of the arrow spec + polars.
As a heads up: the GeoArrow community is reconsidering using a struct type instead of FixedSizeList
for the inner coordinate format. If the only reason to implement FixedSizeList
was for the geo use case, it might be worth holding off for now until that discussion is resolved 🙂 (I understand you might want to implement FixedSizeList
anyways to support tensors)
@ritchie46 is there any news on supporting FixedSizeList in Polars? What would be involved in adding support?
@ritchie46 is there any news on supporting FixedSizeList in Polars? What would be involved in adding support?
It would need a PR similar to this one. https://github.com/pola-rs/polars/pull/5122
I would accept such a PR. It's just a few hours if work.
There are more requests for fixedsizelist + extension types so that we can deal with tensor types
I'd absolutely love to be able to use tensor types within Polars! (I'm currently using xarray, which is awesome but uses Pandas + Dask).
Closing this given https://github.com/pola-rs/polars/pull/8943. Created https://github.com/pola-rs/polars/issues/9112 to track Arrow extension types if there's interest in that as well.