awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Use of cudf Column internals

Open vyasr opened this issue 4 months ago • 5 comments

I'm opening this issue to follow up on https://github.com/scikit-hep/awkward/pull/3328 and https://github.com/rapidsai/cudf/issues/17483. I know that awkward array has relied on some of cuDF's internal Column APIs in the past. cuDF Is now attempting to provide better low-level APIs via the pylibcudf library to allow direct interaction with our C++ primitives. We'll eventually be working on exposing a more easily pluggable middle layer (i.e. to be able to deal with objects in our pandas-like layer), but for now I'm hoping that the pylibcudf bindings are sufficient for a lot of what awkward needs. It is sufficient to call to_pylibcudf on cudf.Series/DataFrame objects to get the underlying pylibcudf types, and the reverse can be done with from_pylibcudf.

Could you please let us know if those types are providing the accessors and methods that you need to support your use cases? That includes however you were using Series._from_column in https://github.com/rapidsai/cudf/issues/17483, but also anything else that you might currently be reaching into our internals for. We'd love to provide suitable public APIs to help you all out now that we're far enough along in our internals refactor to support that.

vyasr avatar Aug 06 '25 19:08 vyasr

CC @martindurant in case you have thoughts here.

vyasr avatar Aug 06 '25 19:08 vyasr

When I previously looked at how the code was shaping up, it looked as though this will require very little change from our end, we'll be able to switch over to pylibcudf without issue. However, it still needs someone to put in the time to do the work and make sure everything still works.

What is the release plan for pylibcudf, and does the older Column stuff disappear now.

anything else that you might currently be reaching into our internals for

It is all essentially a to/from_buffers operation of the type arrow does for the CPU; so for any structured series, it should be possible to get a number of buffer objects (which are cupy or can easily be made so) depending on the type of that column. Or, conversely, to come with a bunch of buffers (or cupy objects) and be able to assemble a series from them.

martindurant avatar Aug 07 '25 14:08 martindurant

it looked as though this will require very little change from our end, we'll be able to switch over to pylibcudf without issue

I'd hope for the shift to be pretty painless since the underlying data schema is still Arrow and accessing buffers works in a similar way. You'll just need to switch to a couple of different API calls.

What is the release plan for pylibcudf

pylibcudf has been publicly released since last August or so, and it is released on the same cadence as cudf itself.

does the older Column stuff disappear now

We are in the process of reworking the older Column layer. That work is being tracked in https://github.com/rapidsai/cudf/issues/18726. I anticipate there always being some Column layer in cudf that sits above pylibcudf and provides the core pandas-like behaviors, but we are still working out exactly what that will look like.

It is all essentially a to/from_buffers operation of the type arrow does for the CPU

pylibcudf now fully supports the Arrow capsule interfaces so all the buffers can be extracted directly that way. I would say that is probably the most robust and future-proof approach for getting the data so that you don't need to learn too much about pylibcudf's own APIs. That being said, you can also directly access data/offset buffers and children from pylibcudf Columns via accessors like this one.

vyasr avatar Aug 11 '25 17:08 vyasr

pylibcudf now fully supports the Arrow capsule interfaces so all the buffers can be extracted directly that way

If that is what you recommend, how in practice can it be called?

martindurant avatar Aug 11 '25 18:08 martindurant

I was taking a peek at the code base here since I don't know it very well, and the answer depends a bit on exactly how you want to integrate. If you would prefer to integrate in the awkward Python package with pure Python code, then I think your best option is to access the data buffers via the accessors I linked in my previous PR. That will allow you to get raw integer points that you can pass around to whatever you need.

If you instead prefer to process the objects directly in C using Python C APIs, then you would use the Python capsule interface. It looks like awkward-cpp is using scikit-build-core with pybind, so you would use the py::capsule type. The capsules above satisfy the Arrow PyCapsule specification, so any library that speaks that convention will know how to get the data out. For example, you could use nanoarrow to access the buffers:

❯ python
Python 3.13.5 | packaged by conda-forge | (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
>>> s = cudf.Series([1, 2, 3])
>>> plc_col, metadata = s.to_pylibcudf()
>>> plc_col.__arrow_c_device_array__()
(<capsule object "arrow_schema" at 0x7fdfd14a91c0>, <capsule object "arrow_device_array" at 0x7fdfd14a9350>)
>>> idx_plc_col, idx_metadata = metadata["index"].to_pylibcudf()
>>> idx_plc_col.__arrow_c_device_array__()
(<capsule object "arrow_schema" at 0x7fdfd14a97b0>, <capsule object "arrow_device_array" at 0x7fdfd14a96c0>)
>>> import nanoarrow.device
>>> nanoarrow.device.c_device_array(plc_col)
<nanoarrow.device.CDeviceArray>
- device_type: CUDA <2>
- device_id: 0
- array: <nanoarrow.c_array.CArray int64>
  - length: 3
  - offset: 0
  - null_count: 0
  - buffers: (0, 140599103586304)
  - dictionary: NULL
  - children[0]:

You want device data views, but just FYI we also provide the corresponding host capsules (which do of course incur a D2H copy). So you could also do

>>> import nanoarrow
>>> nanoarrow.c_array(plc_col)
<nanoarrow.c_array.CArray int64>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0, 94750820544560)
- dictionary: NULL
- children[0]:

Using the capsules lets you process something that conforms to the Arrow C data interface spec using a relatively nice interface. The direct accessor approach is more bare-bones and requires that you know the details of the Arrow spec yourself and how to stitch together the different accessors from pylibcudf. I'm guessing that option wouldn't be too hard for you though since you're already doing that for cudf in awkward.

vyasr avatar Aug 14 '25 17:08 vyasr