datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Row Hash loads whole aggregation state to memory before sending

Open milenkovicm opened this issue 3 years ago • 0 comments

Describe the bug

Row Hash aggregation, loads whole aggregation state to memory before sending a single batch downstream. The resulting record batch will have more rows than predefined batch size

problematic part of code https://github.com/milenkovicm/arrow-datafusion/blob/17f069df4227b837cf2741a545c39a8b68d5fd76/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L438 where iterator without limits is crated, and whole state is cloned, which doubles memory needed for the aggregation state.

function poll_next creates single batch https://github.com/milenkovicm/arrow-datafusion/blob/17f069df4227b837cf2741a545c39a8b68d5fd76/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L146

To Reproduce

Run an aggregation

Expected behavior

Resulting aggregation should be chunked according to the predefined batch size

Additional context

milenkovicm avatar Sep 13 '22 10:09 milenkovicm