h3ronpy icon indicating copy to clipboard operation
h3ronpy copied to clipboard

WIP: array classes

Open kylebarron opened this issue 1 year ago • 7 comments

Just putting up some scratch work for discussion re https://github.com/nmandery/h3ronpy/issues/55

Change list

  • Export classes like CellArray, DirectedEdgeArray, etc (names can be changed) that represent arrow arrays of h3 arrays.
    • This means that in theory we don't need to require pyarrow or arro3 or polars, but rather the user can use whatever python Arrow package they want.
  • Implement the PyCapsule Interface directly on these arrays, so you can pass a CellArray into arro3.core.Array or pyarrow.Array or polars.Series.

kylebarron avatar Oct 08 '24 18:10 kylebarron

@nmandery can you allow CI on this PR, or, preferably, just turn off CI approval in the settings? image

kylebarron avatar Oct 08 '24 20:10 kylebarron

It should now work without manual approval.

nmandery avatar Oct 09 '24 04:10 nmandery

Ok great now the rust side compiles, and we just have to fix up the python side

kylebarron avatar Oct 15 '24 15:10 kylebarron

The dependency updates were:

  • CI was failing because it was trying to install the wrong pip wheel. I figured better to do dependency updates than change how CI is laid out.
  • To restore abi3 wheels, I had to update to pyo3-arrow 0.5.1, which required numpy 0.22, which required pyo3 0.22. That then required ndarray 0.16 which required a bump of geo-interface and rasterh3

🙃

kylebarron avatar Oct 15 '24 16:10 kylebarron

@nmandery I'll need some feedback in terms of how open you are to breaking changes. In particular, you have separate APIs for h3ronpy.polars, h3ronpy.pandas, and h3ronpy.arrow. This is really complex and hard to work through.

Now that Polars supports the PyCapsule Interface, there's simple zero-copy conversion between the two, so you can pass any Arrow array-like or chunked array-like output into the polars.Series constructor: image

In my opinion this means that we should return Arrow output and let the user decide which library they want to use it with. We should document how easy it is to use in this way with polars/pyarrow/pandas etc, but only provide a single API for ease of maintenance.

kylebarron avatar Oct 15 '24 17:10 kylebarron

Feels free to break whatever you like. The whole concept of subpackages for the different dataframe libraries was born to provide a better integration. I suppose the PyCapsule-interface makes a lot of this unnecessary, while achieving a far better integration. Additionally I feel these different subpackages are not too easy to maintain. The only thing I would like to keep is the polars extensions to the Expr and Series namespaces.

I also would prefer sticking with arrow output. +1

nmandery avatar Oct 15 '24 17:10 nmandery

The only thing I would like to keep is the polars extensions to the Expr and Series namespaces.

👍 I definitely agree there

kylebarron avatar Oct 15 '24 20:10 kylebarron

CI failed on commit 5132e01 (#64) with:

error[E0275]: overflow evaluating the requirement `<impl GeometryTrait<T = f64> as geo_traits::geometry::GeometryTrait>::GeometryCollection<'_>: geo_traits::geometry_collection::GeometryCollectionTrait`
    |
    = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`geoarrow`)
note: required for `GeometryCollectionIterator<'_, f64, <... as GeometryCollectionTrait>::ItemType<'_>, ...>` to implement `Iterator`
   --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/geoarrow-0.4.0-beta.1/src/geo_traits/iterator.rs:39:15
    |
37  |                   ItemType: 'a + $item_trait<T = T>,
    |                                              ----- unsatisfied trait bound introduced here
38  |                   G: $self_trait<T = T, ItemType<'a> = ItemType>,
39  |               > Iterator for $struct_name<'a, T, ItemType, G>
    |                 ^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
109 | / impl_iterator!(
110 | |     GeometryCollectionIterator,
111 | |     GeometryCollectionTrait,
112 | |     GeometryTrait,
113 | |     geometry_unchecked
114 | | );
    | |_- in this macro invocation
    = note: required for `GeometryCollectionIterator<'_, f64, <... as GeometryCollectionTrait>::ItemType<'_>, ...>` to implement `IntoIterator`
    = note: the full name for the type has been written to '/home/runner/work/h3ronpy/h3ronpy/target/x86_64-unknown-linux-gnu/release/deps/geoarrow-b79[580](https://github.com/nmandery/h3ronpy/actions/runs/11711779349/job/32621138289?pr=64#step:5:586)3b66daebd2.long-type-15103887714521775568.txt'
    = note: consider using `--verbose` to print the full type name to the console
    = note: the full name for the type has been written to '/home/runner/work/h3ronpy/h3ronpy/target/x86_64-unknown-linux-gnu/release/deps/geoarrow-b795803b66daebd2.long-type-15103887714521775568.txt'
    = note: consider using `--verbose` to print the full type name to the console
    = note: this error originates in the macro `impl_iterator` (in Nightly builds, run with -Z macro-backtrace for more info)
For more information about this error, try `rustc --explain E0275`.

which is a super annoying compiler bug (https://github.com/geoarrow/geoarrow-rs/issues/716).

Updating to the latest git of geoarrow-rs fixed compilation on my computer, even though I don't know exactly what changed to fix that bug. 🤷

kylebarron avatar Nov 06 '24 21:11 kylebarron

@nmandery is it ok with you if the raster -specific code still has a pyarrow dependency for now? It would be nice to come back to that in a follow up so that we can unblock this and minimize the size of the PR

kylebarron avatar Nov 06 '24 22:11 kylebarron

How do you want to handle these polars and pandas tests? Should we keep them and update them to use the current arrow-based APIs?

kylebarron avatar Nov 06 '24 22:11 kylebarron

Lets keep the pyarrow-dependency in the raster-part for now. It will be easier to port this later on once the main work of this PR is done.

The pandas and polars-specific tests can be removed. I only distributed the tests over the different dataframe libraries to have some test-coverage for these APIs as well. I would like to keep the polars extension tests, though.

nmandery avatar Nov 07 '24 11:11 nmandery

At least locally the tests are passing now. Hopefully CI will be green as well

kylebarron avatar Nov 12 '24 01:11 kylebarron

Great. I will look through your changes

nmandery avatar Nov 12 '24 13:11 nmandery

Thanks for this - I will try next to get around to fix the docs etc so we can release this. I suppose you are waiting to integrate this into lonboard.

nmandery avatar Nov 15 '24 11:11 nmandery

Actually @zacharydez is working on a project that I think is using h3ronpy in AWS Lambda, and this would benefit them because pyarrow is too big to put in Lambda (along with other geospatial dependencies).

I would like to integrate this into Lonboard in some way as well, but that's not urgent.

kylebarron avatar Nov 15 '24 14:11 kylebarron