narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Enh]: Construct DataFrame from Arrow PyCapsule object

Open jonmmease opened this issue 1 year ago • 4 comments

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

This request is towards using narwhals to remove the pandas/pyarrow dependencies from VegaFusion 2.0

Please describe the purpose of the new feature or describe the problem to solve.

The flow I'm aiming for with VegaFusion 2.0 is that I'd like to use Narwhals for basic column projection and schema inspection and then use the Arrow PyCapsule API to pass the result to Rust. Then in some cases, the Rust logic will return a new Arrow result in PyCapsule form, and it would be great to be able to use Narwhals to wrap this result using the same backend as the input.

Suggest a solution if possible.

I was picturing perhaps a constructor method in the same family as from_dict, accepting an arrow PyCapsule object.

nw.from_arrow_capsule(cap, native_namespace=nw.get_native_namespace(input_df))

cc @kylebarron for all things Arrow PyCapsule 😄

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

jonmmease avatar Oct 09 '24 16:10 jonmmease

I think there can be a limited use case for passing around raw capsules, but the more general API is if you exported an object from your rust code with an __arrow_c_stream__ dunder method, which then could be imported into narwhals using its existing PyCapsule Interface support. This also reduces the user's reliance on narwhals, and lets them use any Arrow-compatible library of their choosing.

In my own libraries, when I control both sides of the connection, I sometimes do have a from_arrow_capsule method. This can be useful when I want to ensure the user only has one version of arro3.core in their environment, and when I'm using arro3 as the transmission to the user's desired choice of library.

kylebarron avatar Oct 09 '24 17:10 kylebarron

which then could be imported into narwhals using its existing PyCapsule Interface support

Is import already possible in Narwhals? I was under the impression that it was currently only supported on export.

A from_arrow method like you have in arro3 (with an additional namespace argument) would work. But since Narwhals supports wrapping pyarrow already, it seemed like it could be confusing for end users. But maybe this would be a powerful way to convert between libraries.

jonmmease avatar Oct 09 '24 17:10 jonmmease

Oh maybe it's only for export? I'm not up to date.

kylebarron avatar Oct 09 '24 18:10 kylebarron

This is definitely in-scope, thanks for the request, I'll try to put something together soon-ish and we can figure out the details

MarcoGorelli avatar Oct 12 '24 19:10 MarcoGorelli

I've given this a go in https://github.com/narwhals-dev/narwhals/pull/1181, does it look alright / is it what you were looking for?

For libraries which don't (yet?) support the PyCapsule interface for import, I'm first using PyArrow (if installed and at least version 14) and then converting from there (the currently-supported libraries all have a way of going directly from pyarrow tables)

MarcoGorelli avatar Oct 16 '24 07:10 MarcoGorelli

Awesome, this looks great. Thanks so much for the quick implementation!

jonmmease avatar Oct 16 '24 10:10 jonmmease

@jonmmease are you looking for an API that takes a capsule or an API that takes a Python object that exports a capsule object? Your first example seemed to be the former, but #1181 implements the latter

(I think the latter is more useful for more end users, while the former could be useful from native code specifically) In arro3 I have both from_arrow and from_arrow_pycapsule

kylebarron avatar Oct 16 '24 14:10 kylebarron

Either is fine on my end. In practice I think I'll be returning an arro3.core.Table from Rust to Python. So I'd defer to you and @MarcoGorelli on the best API for building narwhals DataFrame from an arro3.core.Table instance.

jonmmease avatar Oct 16 '24 18:10 jonmmease

In that case just the from_arrow constructor would be fine

kylebarron avatar Oct 16 '24 19:10 kylebarron