datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add MemoryReservation to batch splitting in joins

Open alamb opened this issue 1 year ago • 2 comments

Is your feature request related to a problem or challenge?

Follow on to https://github.com/apache/datafusion/pull/12969 and https://github.com/apache/datafusion/issues/12633

In https://github.com/apache/datafusion/issues/12633 @mhilton noted that joins sometimes generate giant record batches which causes issues. @alihan-synnada fixed this in https://github.com/apache/datafusion/pull/12969 but internally sometimes the joins still generate giant output batches.

As @mhilton says in https://github.com/apache/datafusion/pull/12969#issuecomment-2418862655

Unfortunately this doesn't address the actual problem with creating giant batches, which is they require a lot of memory and that memory isn't accounted for in any MemoryPool. Wiring a MemoryReservation into BatchSplitter would probably be enough to address this though.

Describe the solution you'd like

I would like the memory accounting to take into account the large output batch

Describe alternatives you've considered

Wiring a MemoryReservation into BatchSplitter would probably be enough to address

Additional context

No response

alamb avatar Oct 18 '24 13:10 alamb

can i work on this task @alamb ?

jatin510 avatar Oct 18 '24 18:10 jatin510

@jatin510 of course -- see the guide here https://datafusion.apache.org/contributor-guide/index.html#open-contribution-and-assigning-tickets !

alamb avatar Oct 20 '24 12:10 alamb

take

jatin510 avatar Oct 24 '24 14:10 jatin510