pyogrio ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency)

I think in one of the issues/PRs related to adding the Arrow read support we already touched on this topic (about returning a pointer instead of a pyarrow object, so you can also use this without a pyarrow dependency), but don't directly find it anymore. But now that Arrow has standardized the Arrow PyCapsule protocol for exposing those pointers in Python with capsules, I think it makes sense to reconsider this (https://github.com/apache/arrow/issues/39195, https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html)

cc @kylebarron

Some notes:

In the public open_arrow, I added a return_pyarrow keyword with the current default of True, which you can then set to False to allow using this without pyarrow
When enabling that, I let it return a small wrapper class _ArrowStream that just has the protocol's dunder method for now. In theory, we could make this a bit more full-featured class, to make it more similar as pyarrow's RecordBatchReader (eg making it iterable), but not sure who would actually use that
I am still returning the (meta, stream) tuple, although in this case it would be sufficient to return a single class (as we control the class and could attach the meta to the class). But maybe better to keep the return value consistent now?

Jan 25 '24 14:01 jorisvandenbossche

In https://github.com/geopandas/pyogrio/pull/206#issuecomment-1496886332 there was some discussion of whether to return a RecordBatchReader (or a subclass). I don't recall a discussion of separating it from pyarrow.

Overall this is exciting! I'm looking forward to using this from geoarrow-rs without a pyarrow dependency! (I'll make a helper that calls open_arrow with return_capsule=True under the hood)

On the topic of removing dependencies, with arrow it would be feasible to also make numpy an optional dependency, right? Though I would certainly understand if you don't want the maintenance overhead of that.

Jan 25 '24 15:01 kylebarron

On the topic of removing dependencies, with arrow it would be feasible to also make numpy an optional dependency, right? Though I would certainly understand if you don't want the maintenance overhead of that.

I think the main question is whether we use the numpy C API, because in that case it's probably more complicated to have it optional. We do have a cimport numpy as np in _io.pyx, but looking at the call sites of np.<something>, I am not sure there is actually any non-Python usage. If we only use the Python API, it might be quite straightforward to ensure we don't use it in the minimal subset of what open_arrow needs.

Jan 25 '24 15:01 jorisvandenbossche

I think the main question is whether we use the numpy C API, because in that case it's probably more complicated to have it optional.

I was thinking the opposite... is it true that importing the C API only requires numpy as a build dependency? Or does that also require it as a runtime dependency?

Numpy is only used in the non-arrow readers, right? In theory you could split the non-arrow and arrow-based implementations into two separate files where only the non-arrow implementation imports numpy?

Jan 25 '24 16:01 kylebarron

I would think that if you build with the numpy C API, then the shared library has numpy symbols it will not find when numpy is not available at runtime. I am not sure in C/C++ you can have "optional" dependencies at runtime, rather than at build time? (either your library is built with a certain dependency, or it is not?)

Jan 25 '24 16:01 jorisvandenbossche

My guess had been that those symbols would be looked up when code was called that needed, but I really don't know.

Jan 25 '24 18:01 kylebarron

It seems difficult to fully prove that the capsule works in the absence of pyarrow

It looks like you can use nanoarrow's python bindings for that, and the nanoarrow.c_array_stream() function.

Apr 09 '24 23:04 kylebarron

It looks like you can use nanoarrow's python bindings for that, and the nanoarrow.c_array_stream() function.

Indeed; that appears to work, thanks for the tip. I'm assuming we don't yet want to add a CI environment with nanoarrow but without pyarrow to test this, but as we reduce dependency on pyarrow that would be something to consider in the future.

Apr 10 '24 13:04 brendan-ward

For now you can probably just assert that the returned object is not an instance of any known pyarrow class and assert that it works with nanoarrow.c_array_stream()?

Apr 10 '24 14:04 kylebarron

OK, updated this, and changed return_pyarrow to use_pyarrow for the keyword name.

I think I have a slight preference to actually make use_pyarrow=False the default (i.e. a breaking change), so you don't have to specify a keyword when not having pyarrow installed (if we would like to have this as default eventually, better to do that now IMO)

Apr 12 '24 08:04 jorisvandenbossche

make use_pyarrow=False the default

How do you see that working out downstream, e.g., in GeoPandas? I'm assuming we'd still need to use pyarrow in read_arrow for now, right? So there wouldn't yet be a change to read_dataframe that needs to be handled from downstream.

Apr 12 '24 15:04 brendan-ward

How do you see that working out downstream, e.g., in GeoPandas?

Any pyarrow-based downstream library can convert the stream to a pyarrow by passing it to a RecordBatchReader I think

Apr 12 '24 15:04 kylebarron

Yes, it's like the docstring example I added:

with open_arrow(path, use_pyarrow=False) as source:
    meta, stream = source
    # this is the extra line you need with pyarrow=False
    reader = pa.RecordBatchReader.from_stream(stream)
    for table in reader:
        geometries = shapely.from_wkb(table[meta["geometry_name"] or "wkb_geometry"])

And of course, you can still set use_pyarrow=True as a convenience instead of writing that extra line (unless we would just do away with this keyword altogether, given it's only one line extra that users have to write to get the pyarrow object)

And indeed this is only about open_arrow, for read_arrow we continue to return a pyarrow.Table (because there we return the data as a consumed stream, so we need some Arrow library to consume it)

Apr 12 '24 16:04 jorisvandenbossche

Ok - use_pyarrow=False seems like a reasonable default and better to do sooner than later.

Apr 12 '24 16:04 brendan-ward

unless we would just do away with this keyword altogether, given it's only one line extra that users have to write to get the pyarrow object

I'm personally OK with this if we document that line of code. Presumably only relatively advanced users will be using open_arrow

No strong preference either way.

Apr 12 '24 17:04 kylebarron

Thanks @jorisvandenbossche !

Apr 17 '24 13:04 kylebarron