cudf Use Arrow C Data Interface functions for Python interop

Description

This PR replaces the internals of from_arrow in pylibcudf with an implementation that uses the Arrow C Data Interface using the Python Capsule interface. This allows us to decouple our Python builds from using pyarrow Cython (partially, we haven't replaced the to_arrow conversion yet) and it will also allow us to support any other Python package that is a producer of the data interface.

To support the above functionality, the following additional changes were needed in this PR:

Added the ability to produce cudf tables from ArrowArrayStream objects since that is what pyarrow.Table produces. This function is a simple wrapper around the existing from_arrrow(ArrowArray) API.
Added support for the large strings type, for which support has improved throughout cudf since the from_arrow_host API was added and for which we now require a basic overload for tests to pass. I did not add corresponding support for from_arrow_device to avoid ballooning the scope of this PR, so that work can be done in a follow-up.
Proper handling of type_id::EMPTY in concatenate because the most natural implementation of the ArrowArrayStream processing is to run from_arrow on each chunk and then concatenate the outputs, and from the Python side we can produce chunks of all null arrays from arrow.

Contributes to #14926

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Jun 01 '24 00:06 vyasr

@zeroshade @paleolimbot if you have a few moments free I'd appreciate a review here! I have a few specific questions that I wasn't sure about:

There are a couple of TODOs I note regarding object lifetimes in the C++ test
For the Python capsules, are we guaranteed that a Python object implementing the __arrow_c_array__ protocol is a single array, as opposed to a table contained within a struct array? I pushed to keep those more cleanly separated in our C++ APIs in prior PRs, but can we rely on Python libraries that implement these interfaces always using __arrow_c_stream__ for a table while using __arrow_c_array__ only for single arrays?
Is there an expectation that a new version of the array stream will eventually exist that supports device arrays? Not important to address in this PR, just trying to get a sense of what future work there might be.

Jun 04 '24 18:06 vyasr

ccing @jorisvandenbossche for some of the PyCapsule related questions as well 😄

Jun 04 '24 20:06 kkraus14

Thanks Keith!

Jun 05 '24 00:06 vyasr

can we rely on Python libraries that implement these interfaces always using __arrow_c_stream__ for a table while using __arrow_c_array__ only for single arrays?

It's definitely not guaranteed (although I'm the main offender here since I think I added all the exceptions 😬 ): a ChunkedArray implements __arrow_c_stream__ and a nanoarrow.Array implements both. So both a ChunkedArray of structs and a Table might look the same according to the PyCapsule protocol (although the table will typically be marked as a non-nullable struct, unless C# is doing the exporting).

Is there an expectation that a new version of the array stream will eventually exist that supports device arrays?

I believe so! I believe the C struct + specification already exists and that something like __arrow_c_device_stream__ or something will export a capsule to one (but @zeroshade and @jorisvandenbossche have been following this more closely).

Jun 05 '24 01:06 paleolimbot

This isn't really about the capsules, it's more about the C interface I suppose. Clearly there's a bijection between struct arrays and tables, so the representation doesn't appear to be lossy, but as soon as you attempt operations it's clear that the two objects aren't isomorphic and some information is being lost. How are other libraries handling this, either at the C or the Python layer? IOW if a producer generates an arrow array of type struct, how does the consumer know whether that should be turned into a table or a struct array? If we treat pyarrow as the canonical consumer, how is pyarrow handling that?

Jun 05 '24 17:06 vyasr

how does the consumer know whether that should be turned into a table or a struct array? If we treat pyarrow as the canonical consumer, how is pyarrow handling that?

For pyarrow, I believe the user always has to start with a constructor: there's no from_capsule()...it is more like pa.array() and pa.table() can accept capsules (so pyarrow always knows what type of object it's trying to construct before it chooses which __ methods to look for). If pyarrow accepted arbitrary objects in its compute functions it would have to deal with the same issue...off the top of my head I might suggest that it try pa.chunked_array(input_value) and convert to a Scalar if it has length 1. I am not sure the table/struct array distinction would matter there because pyarrow compute functions don't currently operate on tables anyway.

For nanoarrow (whose behaviour has flexibility to change), because it's built around the C data interface, it doesn't make any attempt to disambiguate: there are c_array_stream()s, c_array()s, and c_schema()s, and the higher-level Array as an entry point. In general, operations are built to accept a na.c_array_stream() as input (because anything can be an array stream but the converse is not true).

(Apologies if I'm missing an important point here!)

Jun 05 '24 18:06 paleolimbot

For pyarrow, I believe the user always has to start with a constructor: there's no from_capsule()...it is more like pa.array() and pa.table() can accept capsules (so pyarrow always knows what type of object it's trying to construct before it chooses which __ methods to look for). ... (Apologies if I'm missing an important point here!)

You're not missing anything, but this seems like a meaningful limitation, at least on the Python side. Consider two C++ libraries foo and bar and an application that uses them:

# app.cpp
foo::obj obj = ...;
ArrowArray x = foo::get_array(obj)
bar::array y = bar::y::FromArray(x)  # Option 1
bar::table y = bar::y::FromTable(x)  # Option 2

Is it always the responsibility of the application to orchestrate this correctly and know whether foo::get_array returns an array or a table? Your answer regarding pyarrow suggests that the answer here is yes. In C++ I suppose that's fine because ultimately C++ is strongly typed, so it's not really that restrictive because you don't need any additional information to make the decision here than you already would to type your objects. Ultimately since you're producing the array from obj, you must know whether that object is going to produce an array-like or table-like piece of data. If you added more layers of indirection between foo and bar above, fundamentally the same argument would still apply because each layer would have to forward along type information anyway. Similar arguments would apply to C, module the classes. It does mean that bar essentially has no way to error-check this and relies on the caller providing the right "kind" of array to the API, but it seems OK to push that responsibility onto the application.

That choice seems less desirable in Python due to the potential for dynamically dispatching based on the presence of the attribute. You could do what pyarrow does, which boils down to strongly typing the inputs and falling back to C-like behavior. However, if you wanted to write Pythonic code using protocols the way I would for other protocols like the array interface, I would have a code path that doesn't check an object's type, just whether it exports the protocol. In that scenario, the loss of information seen by the consumer feels more problematic than it does in the C example. WDYT?

Jun 05 '24 23:06 vyasr

I think you are right that it is helpful to know/dispatch based on attributes of the source object; however, I am not sure the mere presence/absence of those methods are ever going to be sufficient to capture what you want to know. @zeroshade has a mailing list discussion open about using the ArrowSchema to do some of this, and there is another one about passing statistics around using schema metadata, and I believe you're active on the "I want to know if this is going to make a copy" one. All of these are along the same lines ("I want to know more about the source object" ).

I think of the presence of the __arrow_c_schema__, __arrow_c_array__, and __arrow_c_stream__ protocol methods as "this object has an unambiguous/lossless representation of itself as X" (where X is a schema, array, or stream), which is perhaps the safest assumption (at least for now).

Jun 06 '24 01:06 paleolimbot

The mailing thread that Dewey mentioned is https://lists.apache.org/thread/95ovmdc1f6x02h2dlmrjjrgxw0gsgx9o

The idea in that thread is to add flags to indicate if an ArrowArray / Arrow Schema represents a RecordBatch vs a StructArray. But indeed, at the moment there is no way to distinguish. Generally, in pyarrow when trying to consume an ArrayArray as a batch (and not as array), we will check that the input is a struct type and has no top-level validity bitmap (since a RecordBatch can never have that). But this indeed means that right now it is up to the user to ensure it passes a "correct" object. For example if you expose a function that is expected to take a table-like object (like pyarrow.table(..)), then we currently assume the user knows that it should pass such kind of object (if they pass an array, and that happens to be a struct array, pyarrow will convert that to a table). Likewise, for pyarrow.array(..) we assume the user intents to pass an array-like object, and so will attempt to convert anything with a __arrow_c_array__ to a pyarrow Array (and if the user passed a table/batch-like object, pyarrow will convert that to a struct array).

Generally that works right now for pyarrow, but it certainly means you can't distinguish them at the moment if you have some functionality that could take both an array or a table.

Jun 06 '24 22:06 jorisvandenbossche

@jorisvandenbossche I may have missed it in an earlier comment, but what is the current status of having a __arrow_c_device_array__ and __arrow_c_device_stream__ in pyarrow?

@vyasr I'm currently traveling for a conference, but I'll try to give this a look over the weekend or next week.

Jun 07 '24 20:06 zeroshade

Generally that works right now for pyarrow, but it certainly means you can't distinguish them at the moment if you have some functionality that could take both an array or a table.

OK cool, thanks for confirming. It sounds like for now even though we'll be using the protocols to get the data, we should still dispatch based on the type of the object rather than strictly by checking for the existence of the protocol methods.

Jun 11 '24 00:06 vyasr

@vyasr I'm currently traveling for a conference, but I'll try to give this a look over the weekend or next week.

Sorry I know I've pinged you a few times over the past couple of weeks. Take your time and enjoy the conference, no rush here. Also I think Dewey and Joris have covered a lot of the bases here. If you do get around to looking, Joris's question regarding why we don't release the ArrowArray and whether that's a change we should be making in from_arrow_host is probably the main one that could use your input.

Jun 11 '24 00:06 vyasr

I may have missed it in an earlier comment, but what is the current status of having a __arrow_c_device_array__ and __arrow_c_device_stream__ in pyarrow?

That discussion lives in https://github.com/apache/arrow/issues/38325. I think there is general agreement on the general method, but some discussion about the details about how to handle (future) keywords (will try to revive it)

Jun 12 '24 11:06 jorisvandenbossche

I'll get this PR updated soon based on the discussions above.

Jun 14 '24 01:06 vyasr

I'm back from my conference travels, so I'm slowly working through my backlog and should get to this within the next day or so to review and respond to the various discussions here. Sorry for the delays!

Jun 17 '24 14:06 zeroshade

I may have missed it in an earlier comment, but what is the current status of having a arrow_c_device_array and arrow_c_device_stream in pyarrow?

That discussion lives in https://github.com/apache/arrow/issues/38325. I think there is general agreement on the general method, but some discussion about the details about how to handle (future) keywords (will try to revive it)

Thanks, let's revive that discussion and get to an agreement on this as it would be preferable to leverage a __arrow_c_device_array__ pycapsule where appropriate than any other option most likely.

Jun 17 '24 22:06 zeroshade

I think this PR is ready to go now (pending reviews). Thanks to everyone for the fruitful discussions! There is plenty of follow-up work to be done that came out of it.

Jun 26 '24 21:06 vyasr

Running

(rapids) coder _ ~/cudf $ GTEST_CUDF_RMM_MODE=cuda LIBCUDF_MEMCHECK_ENABLED=1 compute-sanitizer --tool memcheck ./cpp/build/latest/gtests/INTEROP_TEST 
========= ERROR SUMMARY: 0 errors

Jul 01 '24 18:07 vyasr

/merge

Jul 02 '24 07:07 vyasr