Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Accessor registration

Open martindurant opened this issue 1 year ago • 3 comments

Our project at https://github.com/intake/awkward-pandas is building integrations to dataframe libraries, allowing vectorised processing of nested, variable-length data structures via awkward, i.e., deeper data types than usually handles by dataframe APIs.

Awkward itself boasts, aside from vectorised kernels:

  • numpy-like API (e.g., slicing or aggregations on any index)
  • CPU and GPU kernels
  • numba integration
  • "behaviours" which give something like methods to records of known structures (think operations on vectors, IP addresses or polygons)

It's internal memory model is very close to arrow and interops nicely.

As the name suggests, we trialed awkward-pandas for pandas first, but now we have integrations of various completeness:

  • pandas
  • dask.dataframe
  • polars
  • cuDF (and will therefore be looking for a new name)

These integrations follow the accessor pattern: df.ak.* or series.ak.* gives you the awkward namespace and slicing, e.g.:

df = ...
df["col1"].ak.sum(axis=2) # performs awkward-style sum on inner axis, returning series
df.ak[..., -1] # select last item in inner most list of every column

So, integrating with daft would be the first chance to bring this style of processing to Ray. As with dask.dataframe, making calls that work row-wise or partition-wise is easy; aggregations require first intra-partition operations, followed by some tree reduction; and inter-partition operations are simply hard. (The dask-awkward project shows we are working on this too, from the other direction. This would not be where to start any cooperation!)

martindurant avatar May 08 '24 01:05 martindurant

Thanks @martindurant

I think exploring integrations on a per-row level for our List/FixedSizeList/Tensor types could be interesting. A slicing API for these types would be quite useful I think.

Does an integration require adoption of the awkward in-memory model though? That might be challenging because Daft has its own in-memory data representations as well.

jaychia avatar May 13 '24 18:05 jaychia

As long as you can pass and consume [py]arrow, all is good! That's exactly what the polars integration does, which also uses arrow as the internal representation. And this would be per-row or per-partition operations.

martindurant avatar May 13 '24 18:05 martindurant

I think exploring integrations on a per-row level for our List/FixedSizeList/Tensor

I should add, that the typical use case is for deeper nested things, containing lists and structs many levels down. You can already do some certain things with the existing list type. For example, a geometry object may be represented as a list of list of records array<[[x, y]]> or record of lists of lists array<[[x]], [[y]]>.

martindurant avatar May 13 '24 18:05 martindurant