bench spawn, spawn_blocking, unblock
i noticed we just use tokio::spawn for our spawn_cpu tasks for the builtin runtime. i threw together a simple microbench, i'm wondering if we should use either unblock/spawn_blocking instead?
We use spawn as this follows DataFusion's CPU scheduling logic and makes most sense.
spawn_blocking isn't really for CPU heavy workloads. It's for I/O bound but blocking workloads. In other words, it's ok to have 500 threads running these tasks.
Codecov Report
:x: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 85.04%. Comparing base (69ef61d) to head (35af782).
:warning: Report is 30 commits behind head on develop.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| vortex-io/src/runtime/tokio.rs | 50.00% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
spawn_blocking isn't really for CPU heavy workloads. It's for I/O bound but blocking workloads.
I thought it was explicitly for CPU work, and that's why the spawn_blocking interface takes a function and not a Future? If you spawn enough expensive work to block all the runtime threads you're going to start crushing your tail latencies won't you?
🤷 it's a bit of both?
Tokio will spawn more blocking threads when they are requested through this function until the upper limit configured on the Builder is reached. This limit is very large by default, because spawn_blocking is often used for various kinds of IO operations that cannot be performed asynchronously. When you run CPU-bound code using spawn_blocking, you should keep this large upper limit in mind; to run your CPU-bound computations on only a few threads, you should use a separate thread pool such as rayon rather than configuring the number of blocking threads.
That's why I kept them separate in the Vortex Handle interface, so we could route them to different locations later. I wouldn't be against e.g. TokioRuntime holding an optional second handle to a different runtime / thread pool that we use for CPU tasks.
I setup a simple test case, streaming hits_83.parquet from file through the writer into a Vortex file in tmpfs https://github.com/vortex-data/vortex/pull/5187/files#diff-8d942868d0453754f6247ebafffd3ea2c08c5c051dbe7c9536619c026cd9548d
There were 3 knobs:
- File system: S3 (~100ms avg latency from my laptop) vs local FS
- Use of
spawnvsunblockfor executing the CPU tasks - Number of Tokio worker threads (1, 2, 4, 8)
The results for the S3 test:
In chart form

Results for the Local FS test:

In chart form:

For the smaller worker thread counts, the unblock version performs considerably better, which isn’t too surprising because that is the situation where you’re likely to get worker thread starvation from the CPU tasks.
With some println debugging, it appears that the CPU tasks spawned inside of the CompressingStrategy can be > 5ms to compress a chunk, which in the spawn mode is time spent blocking the executor, and we spawn each chunk for each column, so the number of blocking tasks grows based on the size of the schema.
The performance difference goes away with larger thread counts, but that doesn’t mean it’s not structurally present, just that we haven’t thrown a workload at it where however many worker thread Tokio spawns by default can’t handle it.