cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Migrate expressions to pylibcudf

Open lithomas1 opened this issue 1 year ago • 1 comments

Description

xref #15162

Migrates expresions to use pylibcudf.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

lithomas1 avatar Jun 18 '24 18:06 lithomas1

Draft pending resolution of Vyas's comments.

lithomas1 avatar Jun 26 '24 00:06 lithomas1

@vyasr Let's put this in as is for now?

lithomas1 avatar Jul 15 '24 17:07 lithomas1

Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately?

wence- avatar Jul 16 '24 14:07 wence-

Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately?

Yeah, I kinda thought about this for a bit, but it's a pretty involved fix. (but eventually, I would like this to take a pylibcudf Scalar)

Easy way out is: Just convert everything to a pyarrow scalar and call interop on it (this is what DeviceScalar does). This is probably not good for the goal of getting pyarrow out of the Cython.

I could also try to delete the datetime/timedelta handling code in this PR. (I don't think that gets used in cuDF Python)

The more correct way is to use the interchange mechanisms on the array object/scalar object we take in which I talked about with Vyas.

IMO, the best way to do this is to use DLPack, so we only have to support one thing. (Not sure if we need C++ changes here, though )

lithomas1 avatar Jul 16 '24 14:07 lithomas1

(but eventually, I would like this to take a pylibcudf Scalar)

Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Scalar object and constructs (via casting) the correct numeric_scalar/timestamp_scalar/duration_scalar.

Effectively you would just replace the current switch statement that dispatches on numpy-like types with one that dispatches on scalar.type().id(), I think.

What am I missing?

wence- avatar Jul 16 '24 14:07 wence-

I could also try to delete the datetime/timedelta handling code in this PR. (I don't think that gets used in cuDF Python)

This is kind of critical for predicate pushdown in cudf-polars.

wence- avatar Jul 16 '24 14:07 wence-

(but eventually, I would like this to take a pylibcudf Scalar)

Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Scalar object and constructs (via casting) the correct numeric_scalar/timestamp_scalar/duration_scalar.

I meant creating the Scalar itself. The Scalar constructor has to accept a lot of possible inputs (e.g. Python scalars, Numpy scalars, Pyarrow scalars, potentially cupy scalars)

I think your proposed approach would work after we fixed the Scalar constructor. (actually now that I think about it, maybe fixing this isn't a very immediate concern, I thought we'd need pyarrow in the Cython, but we can just force people to do plc.interop.from_arrow in their Python code outside the Cython)

EDIT: I will try the approach taking in a Scalar.

lithomas1 avatar Jul 16 '24 14:07 lithomas1

I think your proposed approach would work after we fixed the Scalar constructor.

Yeah, I think from the pylibcudf point of view in some sense, for the expression API, it is not (or should not be) the job of pylibcudf to help with the construction of a Scalar in the first place.

wence- avatar Jul 16 '24 16:07 wence-

/merge

lithomas1 avatar Jul 16 '24 20:07 lithomas1