spark-rapids Optimzing Expand+Aggregate in sqls with many count distinct [WIP]

Optimzing Expand+Aggregate in sqls with many count distinct [WIP]

Open binmahone opened this issue 1 year ago • 7 comments

Fixing https://github.com/NVIDIA/spark-rapids/issues/10799. This PR tries to optimize the Expand&Aggregate exec in the first stage of a sql with many count distinct measures.

The optimizations in this PR include:

Avoid allocating&initializing large number of null vectors when doing Expand
Try coaleasce expanded column batches before sending them to Aggregate

May 13 '24 06:05 binmahone

build

May 13 '24 06:05 binmahone

build

May 14 '24 03:05 binmahone

@revans2 @abellina @winningsix can you pls take a look of this PR ? we're going to pack a debug build based on this PR

May 14 '24 06:05 binmahone

Please retarget to 24.08

May 29 '24 12:05 sameerz

build

Jun 26 '24 02:06 binmahone

@wjxiz1992 query perf pass

Jun 26 '24 06:06 binmahone

Hi @revans2 @abellina , since we're getting often-contradictory conclusions from customer side, we decide to hold on this PR until things are clearer. I'll turn back to address your comments once we're confident that these optimizations are always benificial.

Jul 02 '24 02:07 binmahone

Hi @revans2 @abellina , since we're getting often-contradictory conclusions from customer side, we decide to hold on this PR until things are clearer. I'll turn back to address your comments once we're confident that these optimizations are always benificial.

@GaryShen2008 , I suggest to move this PR to 2410 because of the quoted reason

Jul 26 '24 01:07 binmahone

Please retarget to the 24.10 branch.

Jul 29 '24 23:07 sameerz

Hi @revans2 I simplified the code to make it unnecessary to worry about the side effects of global caching for null vectors. The cache reuse ratio would be smaller than previous version, but it would suffice for our customer's use case (a query with a lot of count distincts). Please help to review again

Sep 06 '24 05:09 binmahone

build

Sep 23 '24 15:09 abellina

build

Sep 23 '24 15:09 revans2

build

Sep 24 '24 06:09 binmahone

close #10799

Sep 26 '24 01:09 binmahone

spark-rapids spark-rapids copied to clipboard

Optimzing Expand+Aggregate in sqls with many count distinct [WIP]

spark-rapids
spark-rapids copied to clipboard