polars icon indicating copy to clipboard operation
polars copied to clipboard

Feature request: Support for FixedSizeList

Open kylebarron opened this issue 2 years ago • 10 comments

Along with @stuartlynn, I've been working on https://github.com/kylebarron/geopolars to extend polars to add support for geospatial data, much like GeoPandas extends Pandas (see also polars issues #1830, #3208).

With Arrow, the whole ecosystem benefits when a common memory layout is used. There's been a lot of work in https://github.com/geopandas/geo-arrow-spec to define common ways to store vector geospatial data (points, lines, polygons, etc) in Arrow memory. Right now, two alternate layouts are defined in the spec:

  • geoarrow.wkb: use a Binary column where geometries are stored in Well-Known Binary format. WKB is common in the geo-world, but this is a less performant storage format; coordinates can't be accessed with zero copy and parsing is O(n).
  • An Arrow-native encoding using a combination of List and FixedSizeList (spec). This is more performant because geometry access to any coordinate is possible in O(1) time and zero-copy access is possible. For example:
    • 2D Points: geoarrow.point: FixedSizeList<f64>[2]
    • 2D Lines: geoarrow.linestring: List<FixedSizeList<f64>[2]>
    • 2D MultiPolygons: geoarrow.multipolygon: List<List<List<FixedSizeList<f64>[2]>>>

Therefore, to support the current version of the geo-arrow-spec, FixedSizeList would be a necessary data type.

Arrow2 supports FixedSizeList. Beyond that, I don't know the polars codebase well enough to know how much work it would be to add and support FixedSizeList. Would it be possible to reuse existing List support for FixedSizeList?

Thoughts? I would be open to submitting a PR for this as well.

Appendix

Current Behavior

When trying to load this Arrow file (cities-geoarrow.arrow.zip), with schema:

pyarrow.Table
name: string
geometry: fixed_size_list<xy: double not null>[2]
  child 0, xy: double not null

into Polars using table = pyarrow.feather.read_table(path); polars.from_arrow(table) it errors with:

Cannot create polars series from FixedSizeList(Field { name: "xy", data_type: Float64, is_nullable: false, metadata: {} }, 2) type

Example files:

  • cities-geoarrow.arrow.zip: dataset with Point geometries in a fixed_size_list<xy: double not null>[2] column
  • nationalpark.arrow.zip: dataset with MultiPolygon geometries in a column:
    geometry: list<item: list<item: list<item: fixed_size_list<xy: double not null>[2]>>>
      child 0, item: list<item: list<item: fixed_size_list<xy: double not null>[2]>>
          child 0, item: list<item: fixed_size_list<xy: double not null>[2]>
              child 0, item: fixed_size_list<xy: double not null>[2]
                  child 0, xy: double not null
    

kylebarron avatar Jul 14 '22 05:07 kylebarron

I have been thinking about this and is something that might fit in the scope of polars eventually. It is a lot of work with currently not much benefit with regard to the default list type. Eventually I'd like geotypes under the polars umbrella, but I first want to mature the default use case and have not a battle on two fronts.

ritchie46 avatar Jul 19 '22 06:07 ritchie46

Can Structs be used instead of FixedSizeLists? For 2-3 data points, I'm wondering if the list properties are relevant.

cjermain avatar Jul 20 '22 11:07 cjermain

On our side that should work if we were to implement geo types.

ritchie46 avatar Jul 20 '22 15:07 ritchie46

My goal is be to be compliant with the GeoArrow specification in development. At this point, the spec defines a nested list format where the inner array is a FixedSizeList. To the geo world, this is kind of the "best of both worlds" because the logical layout matches the coordinates array from GeoJSON while the physical Arrow layout is a flat array of coords.

implement geo types

My preference is to not use a polars Object type Series containing rust structs defined in the geo crate, because this has the usual non-Arrow drawbacks including serialization and deserialization costs every time data is loaded or shared with a program outside of polars.

Today, geopolars still has an extra copy from Arrow data into geo structs, but my long-term goal is to work with the geo crate to restructure their algorithms around geometry traits, so that geometry data in Arrow can be accessed zero-copy (see https://github.com/georust/geo/discussions/838, https://github.com/georust/geo/issues/67).

Can Structs be used instead of FixedSizeLists?

To clarify, are you referring to rust structs or Arrow structs? Early on in GeoArrow discussions, an Arrow Struct format was proposed, but this was decided against because it is nearly identical to the physical layout of the nested list approach, while lacking the easier logical API of the nested lists.

It is a lot of work with currently not much benefit with regard to the default list type

I'm sympathetic to the extra dev overhead of new data types. I wonder whether it would be possible to add some sort of minimal "container" data type that just wraps Arrow arrays but doesn't have full polars support otherwise. In the current approach of geopolars, we don't need or use any polars-specific methods on the geometry column (but the point is for users to access polars operations on all their non-geometry columns); we just access the underlying arrow data, pass it to an algorithm, and create a new series. So all we need is a way to store this more "custom" column data layout in a column alongside the rest of a polars DataFrame.

kylebarron avatar Jul 20 '22 16:07 kylebarron

For 2-3 data points, I'm wondering if the list properties are relevant

Not sure I understand. For a geometry column of type Point, each row of the geometry column would contain only 2 or 3 numbers. But for a geometry column of type Polygon, each row could contain thousands of vertices. E.g. for a GeoDataFrame representing countries where each row is one country and the geometry column includes the country's boundary, a single unsimplified geometry could include tens of thousands of vertices.

kylebarron avatar Jul 20 '22 16:07 kylebarron

There are more requests for fixedsizelist + extension types so that we can deal with tensor types. I want to add fixedsizelist type as a minimal type. One that can be put into a DataFrame and supports minimal aggregations and take functionality. That should allow third parties to work with more of the arrow spec + polars.

ritchie46 avatar Aug 11 '22 19:08 ritchie46

As a heads up: the GeoArrow community is reconsidering using a struct type instead of FixedSizeList for the inner coordinate format. If the only reason to implement FixedSizeList was for the geo use case, it might be worth holding off for now until that discussion is resolved 🙂 (I understand you might want to implement FixedSizeList anyways to support tensors)

kylebarron avatar Sep 23 '22 11:09 kylebarron

@ritchie46 is there any news on supporting FixedSizeList in Polars? What would be involved in adding support?

jondo2010 avatar Oct 11 '22 10:10 jondo2010

@ritchie46 is there any news on supporting FixedSizeList in Polars? What would be involved in adding support?

It would need a PR similar to this one. https://github.com/pola-rs/polars/pull/5122

I would accept such a PR. It's just a few hours if work.

ritchie46 avatar Oct 11 '22 14:10 ritchie46

There are more requests for fixedsizelist + extension types so that we can deal with tensor types

I'd absolutely love to be able to use tensor types within Polars! (I'm currently using xarray, which is awesome but uses Pandas + Dask).

JackKelly avatar Jan 10 '23 19:01 JackKelly

Closing this given https://github.com/pola-rs/polars/pull/8943. Created https://github.com/pola-rs/polars/issues/9112 to track Arrow extension types if there's interest in that as well.

kylebarron avatar May 29 '23 21:05 kylebarron