cudf
cudf copied to clipboard
[FEA] first and last as hash based aggregates
Is your feature request related to a problem? Please describe.
In a recent customer query I found that it was rather slow because they were doing a first aggregation along with a number of other small aggregations. First is translated into an NTH_ELEMENT aggregation, either with or without null handling. But NTH_ELEMENT is implemented as a sort based aggregation, not a hash based one. This slows down the entire query. If I replace first with max the entire query gets to be 18% faster.
I know in the general case NTH_ELEMENT cannot be a hash based aggregation, but for common SQL operations like first and last we should be able to make it a hash aggregate.
Describe the solution you'd like
I personally would love for some magic to happen behind the scenes with NTH_ELEMENT where it can become a hash aggregate if n is 0 or -1. But I would be happy to have some new FIRST and LAST aggregations instead, if that is simpler.
The algorithm I have been thinking about for first, is to do a min aggregation on a counting sequence starting at 0. Then we use the result of that as a gather map to pull in the first value from the original input column. If we don't want to include nulls, then instead of using the counting iterator directly we also pull in the null mask from the original input column, and we replace nulls in the gather map with -1 before doing the gather.
For last we would switch it over to doing max aggregation instead of a min.
Describe alternatives you've considered We could writ this ourselves, but the current Spark aggregation code does not make it simple to get access to the original input column after doing the aggregation. I can change that, but I didn't want to do that until I head from CUDF about how hard this might be. Also the CUDF version is going to be more efficient, because I might not have to materialize as much data depending on how much can use thrust iterators to manipulate the original input data.
Additional context Add any other context, code examples, or references to existing implementations about the feature request here.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
I still want this it would be a good performance boost for anyone in SQL doing a first or a last like operation.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Still wanted.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
#11603 was an attempt to address this issue, but we ran into design challenges. Please refer to the discussion in this comment
Maybe static_reduction_map can help this: https://github.com/NVIDIA/cuCollections/pull/98.
Note that similar mechanism has already been implemented in cudf (https://github.com/rapidsai/cudf/pull/11052). We can adopt the same approach if need this with higher priority.