spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Optimzing Expand+Aggregate in sqls with many count distinct [WIP]

Open binmahone opened this issue 1 year ago • 7 comments

Fixing https://github.com/NVIDIA/spark-rapids/issues/10799. This PR tries to optimize the Expand&Aggregate exec in the first stage of a sql with many count distinct measures.

The optimizations in this PR include:

  1. Avoid allocating&initializing large number of null vectors when doing Expand
  2. Try coaleasce expanded column batches before sending them to Aggregate

binmahone avatar May 13 '24 06:05 binmahone

build

binmahone avatar May 13 '24 06:05 binmahone

build

binmahone avatar May 14 '24 03:05 binmahone

@revans2 @abellina @winningsix can you pls take a look of this PR ? we're going to pack a debug build based on this PR

binmahone avatar May 14 '24 06:05 binmahone

Please retarget to 24.08

sameerz avatar May 29 '24 12:05 sameerz

build

binmahone avatar Jun 26 '24 02:06 binmahone

@wjxiz1992 query perf pass

binmahone avatar Jun 26 '24 06:06 binmahone

Hi @revans2 @abellina , since we're getting often-contradictory conclusions from customer side, we decide to hold on this PR until things are clearer. I'll turn back to address your comments once we're confident that these optimizations are always benificial.

binmahone avatar Jul 02 '24 02:07 binmahone

Hi @revans2 @abellina , since we're getting often-contradictory conclusions from customer side, we decide to hold on this PR until things are clearer. I'll turn back to address your comments once we're confident that these optimizations are always benificial.

@GaryShen2008 , I suggest to move this PR to 2410 because of the quoted reason

binmahone avatar Jul 26 '24 01:07 binmahone

Please retarget to the 24.10 branch.

sameerz avatar Jul 29 '24 23:07 sameerz

Hi @revans2 I simplified the code to make it unnecessary to worry about the side effects of global caching for null vectors. The cache reuse ratio would be smaller than previous version, but it would suffice for our customer's use case (a query with a lot of count distincts). Please help to review again

binmahone avatar Sep 06 '24 05:09 binmahone

build

abellina avatar Sep 23 '24 15:09 abellina

build

revans2 avatar Sep 23 '24 15:09 revans2

build

binmahone avatar Sep 24 '24 06:09 binmahone

close #10799

binmahone avatar Sep 26 '24 01:09 binmahone