spark-rapids
spark-rapids copied to clipboard
Optimzing Expand+Aggregate in sqls with many count distinct [WIP]
Fixing https://github.com/NVIDIA/spark-rapids/issues/10799. This PR tries to optimize the Expand&Aggregate exec in the first stage of a sql with many count distinct measures.
The optimizations in this PR include:
- Avoid allocating&initializing large number of null vectors when doing Expand
- Try coaleasce expanded column batches before sending them to Aggregate
build
build
@revans2 @abellina @winningsix can you pls take a look of this PR ? we're going to pack a debug build based on this PR
Please retarget to 24.08
build
@wjxiz1992 query perf pass
Hi @revans2 @abellina , since we're getting often-contradictory conclusions from customer side, we decide to hold on this PR until things are clearer. I'll turn back to address your comments once we're confident that these optimizations are always benificial.
Hi @revans2 @abellina , since we're getting often-contradictory conclusions from customer side, we decide to hold on this PR until things are clearer. I'll turn back to address your comments once we're confident that these optimizations are always benificial.
@GaryShen2008 , I suggest to move this PR to 2410 because of the quoted reason
Please retarget to the 24.10 branch.
Hi @revans2 I simplified the code to make it unnecessary to worry about the side effects of global caching for null vectors. The cache reuse ratio would be smaller than previous version, but it would suffice for our customer's use case (a query with a lot of count distincts). Please help to review again
build
build
build
close #10799