cudf
cudf copied to clipboard
[FEA] Support Polars `top_k` expression (group_by)
The top_k expression can be used as a reduction on primitive types, but is often used alongside group_by and will return a nested type.
import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
use_cudf = partial(execute_with_cudf, raise_on_fail=True) # for testing
df = pl.LazyFrame(
{
'a': [1,1,2,2,3,3],
'b':[1,2,3,4,5,6],
}
)
print(df.group_by("a").agg(pl.col("b").top_k(1)).collect())
print(df.group_by("a").agg(pl.col("b").top_k(1)).collect(post_opt_callback=use_cudf))
shape: (3, 2)
┌─────┬───────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1 ┆ [2] │
│ 3 ┆ [6] │
│ 2 ┆ [4] │
└─────┴───────────┘
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
Cell In[31], line 15
7 df = pl.LazyFrame(
8 {
9 'a': [1,1,2,2,3,3],
10 'b':[1,2,3,4,5,6],
11 }
12 )
14 print(df.group_by("a").agg(pl.col("b").top_k(1)).collect())
---> 15 print(df.group_by("a").agg(pl.col("b").top_k(1)).collect(post_opt_callback=use_cudf))
File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
1939 # Only for testing purposes atm.
1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))
ComputeError: 'cuda' conversion failed: NotImplementedError: Unary function name='top_k'
WIP implementation in libcudf here: https://github.com/rapidsai/cudf/pull/15272
Progress on this is blocked by https://github.com/rapidsai/cudf/issues/15541. We need a cleaner way to return the output of a top_k aggregation from libcudf when that is mixed with other aggregations that are reductions producing a single output.
Also could use #19096