Yijie Shen comments

Results 29 comments of


                                            Yijie Shen

Suspicious slow test in Ballista

@Ted-Jiang any ideas on this?

Suspicious slow test in Ballista

https://github.com/apache/arrow-datafusion/runs/5838876700?check_suite_focus=true test cpu_bound_executor::tests::executor_shutdown_while_task_running has been running for over 60 seconds Our CI encountered this as well.

Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication)

I agree we need `List` support in Row since it's used by `ApproxPercentileCont`, `ArrayAgg`, `Distinct*`, etc., as state fields. `Struct` is not currently used as a state for existing accumulators,...

Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication)

Note on `RowType::WordAligned`, which is used as the grouping state for hash aggregation: Since the varlena field would expand its width as new updates come in, fields after varlena should...

Further refine the Top K sort operator

TL;DR: The issue is caused by "double" memory accounting for sliced batches in AggExec and TopkExec. ------ The primary cause of resource exhaustion is incorrect memory accounting for record batches...

Further refine the Top K sort operator

From DataFusion's memory management perspective, I found that `get_slice_memory_size`, introduced in https://github.com/apache/arrow-rs/pull/3501, better serves our requirements. I suggest we have `RecordBatch::get_effective_memory_size()` in DF and use `get_slice_memory_size` to account for memory...

Further refine the Top K sort operator

I agree that the core problem for the issue is accounting and that the most overreported batch slice would come from AggExec's mono output record batch. But I also believe...

Memory account not adding up in SortExec

> Through examining the current implementation of multi-column sort's spill-to-disk strategies, I find we are asking for more memory during spill, which I think is worth discussing: During the spill,...

Memory account not adding up in SortExec

Another point of code worth noticing is inside the current `sort_batch` implementation: https://github.com/apache/datafusion/blob/79fa6f9098be9a6e5b269cd3642694765b230ff1/datafusion/physical-plan/src/sorts/sort.rs#L601-L607 Performance-wise, I think it's beneficial to apply the row format comparison to all multi-column cases, however, while...