datafusion
datafusion copied to clipboard
Example for using a separate threadpool for CPU bound work (try 3)
Note: This PR contains an example and supporting code. It has no changes to the core.
Which issue does this PR close?
-
Closes https://github.com/apache/datafusion/issues/12393
-
Note this is new version version of https://github.com/apache/datafusion/pull/14286
Rationale for this change
I have heard from multiple people multiple times over multiple years that the specifics of using multiple threadpools for separate CPU and IO work in DataFusion is confusing.
They are not wrong, and it is a key detail for building low latency, high performance engines which process data directly from remote storage, which I think is a key capability for DataFusion
My past attempts in https://github.com/apache/datafusion/pull/13424 and https://github.com/apache/datafusion/pull/14286 to make this example have been bogged down trying to get consensus on details of how to transfer results across streams, the wisdom of wrapping streams, and other details. Thankfully, thanks to @tustvold and @ion-elgreco there is now a much better solution in ObjectStore 0.12.1: https://github.com/apache/arrow-rs-object-store/pull/332
What changes are included in this PR?
thread_pools.rsexample- Update documentation
Are these changes tested?
Yes the example is run as part of CI and there are tests
Are there any user-facing changes?
Not really
FYI @pepijnve this is perhaps relevant to you as you have been working closely with tokio / tokio runtimes
@alamb thanks for the pointer.
I've been digging through all the work that has already been done, looking particularly at the rayon experiments. Rayon's idea of "yield now" is particularly interesting. Rather than unwinding the stack, it will actually try to run other work immediately underneath the yield_now call. I wonder if this approach could lead to stack overflows when you have lots of yielding tasks.
Related to this work specifically, is there any guidance on using something like Tokio console with DataFusion? I was about to start experimenting with it myself, but if anyone else has written up some instructions or experiences that might save me some time.
@alamb thanks for the pointer.
Related to this work specifically, is there any guidance on using something like Tokio console with DataFusion? I was about to start experimenting with it myself, but if anyone else has written up some instructions or experiences that might save me some time.
I know we have used it at INfluxData
I vaguely remember others reporting they have too (maybe @adriangb ) but I don't know of any writeups (it would be another great doc contribution if you figure it out ) :)
Related to this work specifically, is there any guidance on using something like Tokio console with DataFusion? I was about to start experimenting with it myself, but if anyone else has written up some instructions or experiences that might save me some time.
I've used Tokio console in the past in my app to see if I could find the cause of some app slowdowns. It worked fine but didn't help me personally find the issue.
Thanks @adriangb and @Omega359 for the help with this one (and to @tustvold for / @ion-elgreco for the underlying feature)
It's taken a while but we have made it