datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Example for using a separate threadpool for CPU bound work (try 3)

Open alamb opened this issue 6 months ago • 4 comments

Note: This PR contains an example and supporting code. It has no changes to the core.

Which issue does this PR close?

  • Closes https://github.com/apache/datafusion/issues/12393

  • Note this is new version version of https://github.com/apache/datafusion/pull/14286

Rationale for this change

I have heard from multiple people multiple times over multiple years that the specifics of using multiple threadpools for separate CPU and IO work in DataFusion is confusing.

They are not wrong, and it is a key detail for building low latency, high performance engines which process data directly from remote storage, which I think is a key capability for DataFusion

My past attempts in https://github.com/apache/datafusion/pull/13424 and https://github.com/apache/datafusion/pull/14286 to make this example have been bogged down trying to get consensus on details of how to transfer results across streams, the wisdom of wrapping streams, and other details. Thankfully, thanks to @tustvold and @ion-elgreco there is now a much better solution in ObjectStore 0.12.1: https://github.com/apache/arrow-rs-object-store/pull/332

What changes are included in this PR?

  1. thread_pools.rs example
  2. Update documentation

Are these changes tested?

Yes the example is run as part of CI and there are tests

Are there any user-facing changes?

Not really

alamb avatar Jun 08 '25 16:06 alamb

FYI @pepijnve this is perhaps relevant to you as you have been working closely with tokio / tokio runtimes

alamb avatar Jun 17 '25 16:06 alamb

@alamb thanks for the pointer.

I've been digging through all the work that has already been done, looking particularly at the rayon experiments. Rayon's idea of "yield now" is particularly interesting. Rather than unwinding the stack, it will actually try to run other work immediately underneath the yield_now call. I wonder if this approach could lead to stack overflows when you have lots of yielding tasks.

Related to this work specifically, is there any guidance on using something like Tokio console with DataFusion? I was about to start experimenting with it myself, but if anyone else has written up some instructions or experiences that might save me some time.

pepijnve avatar Jun 17 '25 16:06 pepijnve

@alamb thanks for the pointer.

Related to this work specifically, is there any guidance on using something like Tokio console with DataFusion? I was about to start experimenting with it myself, but if anyone else has written up some instructions or experiences that might save me some time.

I know we have used it at INfluxData

I vaguely remember others reporting they have too (maybe @adriangb ) but I don't know of any writeups (it would be another great doc contribution if you figure it out ) :)

alamb avatar Jun 17 '25 16:06 alamb

Related to this work specifically, is there any guidance on using something like Tokio console with DataFusion? I was about to start experimenting with it myself, but if anyone else has written up some instructions or experiences that might save me some time.

I've used Tokio console in the past in my app to see if I could find the cause of some app slowdowns. It worked fine but didn't help me personally find the issue.

Omega359 avatar Jun 17 '25 16:06 Omega359

Thanks @adriangb and @Omega359 for the help with this one (and to @tustvold for / @ion-elgreco for the underlying feature)

It's taken a while but we have made it

alamb avatar Jun 23 '25 10:06 alamb