datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Generate hash aggregation output in smaller record batches

Open milenkovicm opened this issue 3 years ago • 5 comments

this change would prevent of cloning of whole state, doubling memory needed for aggregation.

relates to apache/arrow-datafusion#1570

Which issue does this PR close?

Closes #3460.

Rationale for this change

What changes are included in this PR?

update poll_next method to return multiple aggregation state batches rather than a single one.

Are there any user-facing changes?

No

milenkovicm avatar Sep 13 '22 10:09 milenkovicm

Thank you @milenkovicm -- I plan to review this more carefully tomorrow morning.

cc @Dandandan and @yjshen

alamb avatar Sep 14 '22 20:09 alamb

Note there is an almost similar copy of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/hash.rs for non row format, which likely needs the same treatment (though we could do it as a follow on PR)

is it line 442 which is "unbounded" ? https://github.com/apache/arrow-datafusion/blob/84bee899958aaf70372ef84811c6787f53fa25eb/datafusion/core/src/physical_plan/aggregates/hash.rs#L442

milenkovicm avatar Sep 15 '22 08:09 milenkovicm

is it line 442 which is "unbounded" ?

Yes that looks correct

alamb avatar Sep 15 '22 12:09 alamb

is it line 442 which is "unbounded" ?

Yes that looks correct

may I suggest merging this one, and I'll try to patch that one in due course.

One question before hand, which will save me some time, which aggregation operators will end up using that hash?

milenkovicm avatar Sep 15 '22 12:09 milenkovicm

One question before hand, which will save me some time, which aggregation operators will end up using that hash?

I think it is based on the type of the aggregate and if it supports a special "row format" added by @yjshen

This ticket describes the reason (and the potential challenges) with having multiple hash aggregate operators: https://github.com/apache/arrow-datafusion/issues/2723

alamb avatar Sep 15 '22 13:09 alamb

Benchmark runs are scheduled for baseline = 011bcf4901718a8a467352275f4074647d2d8ba2 and contender = 0b90a8a5c2635e08e80995954271fd06a256ac96. 0b90a8a5c2635e08e80995954271fd06a256ac96 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Oct 15 '22 11:10 ursabot

Thanks again @milenkovicm

alamb avatar Oct 15 '22 11:10 alamb