Report splits to DataFusion as partitions

Open gatesn opened this issue 1 year ago • 3 comments

Nov 28 '24 18:11 gatesn

@robert3005 do you think we want to use this + InlineExecutor?

Jan 23 '25 11:01 gatesn

I think if you do this you might still want a threaded runtime (my point of reference is that in spark parquet still spins up a forkjoinpool per task), just that each one of those partitions will have fewer row splits. Fwiw this is mostly useful in case of failures since I assume datafusion will retry only failing partitions.

Jan 23 '25 11:01 robert3005

The benefit of this is that DataFusion can choose to parallelise the rest of the execution plan, whereas in the current model, we do use all threads of the runtime to scan a file, but DataFusion sees a serialized sequence of record batches when in reality the ordering often doesn't matter at all.

Apr 14 '25 08:04 gatesn