cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Add a TopK API to cudf

Open davidwendt opened this issue 6 months ago • 1 comments

"Top K" is used to identify the most important or frequent items within a dataset, making it useful for various applications involving data analysis and machine learning. Proposing a topk (or similar named) API for cudf that returns the largest (or smallest) values in a column given a value of K as a parameter. This could be a useful accelerated function for many RAPIDS applications.

Proposing 2 APIs for topk for libcudf:

enum topk_sort { LARGEST, SMALLEST };
std::unique_ptr<column> topk( column_view const& input, size_type k, topk_sort tks,
  cuda_stream_view stream, device_async_resource_ref mr);

std::unique_ptr<column> topk_ordered( column_view const& input, size_type k, topk_sort tks,
  cuda_stream_view stream, device_async_resource_ref mr);

The first would return the values from the topk while the ordered one would return just the indices to be used in a subsequent gather call. The default topk_sort parameter would be LARGEST.

Implementation plan There is set of TopK functions in the process of being added to CUB which libcudf could use. Multiple columns (table) may be possible though may be somewhat limited. Also, the CUB implementation only supports fixed-width types so if other types are required (e.g. STRING) then custom kernels may need to be written. Or the topK is performed using a 2-step sort and slice for these types.

Consideration for nulls apply as well. Since topK results should likely not include nulls, custom iterators (supported by CUB) may be required to exclude these rows.

davidwendt avatar Jun 05 '25 15:06 davidwendt

I don't know if you intended it to be in scope for this issue, but beyond implementing this on a per-column basis we would also benefit from supporting it as a groupby aggregation. That would enable https://github.com/rapidsai/cudf/issues/16222.

vyasr avatar Jun 11 '25 00:06 vyasr

AFAIK this is closed by #19303. CC @davidwendt

Matt711 avatar Jul 29 '25 16:07 Matt711