dora icon indicating copy to clipboard operation
dora copied to clipboard

Replace `future::executor::block_on` with tokio

Open haixuanTao opened this issue 3 months ago • 1 comments

I was tinkering around the drain functionality and identified that future::executor::block_on spawn a new thread pool which is very slow and by replacing by an existing Tokio threadpool that we already use for telemetry we can remove almost ~1ms latency per message, making dora able to process 1khz like nodes.

Not only this but it reduces the number of CPU threads and therefore reduce the intensity of running dora on CPU enabling faster processing.

Comparing performance in Rust

cargo run --example benchmark --release
rust-sink: stdout    Latency:
rust-sink: stdout    size 0x0     : 361.003µs
rust-sink: stdout    size 0x8     : 244.002µs
rust-sink: stdout    size 0x40    : 217.002µs
rust-sink: stdout    size 0x200   : 136.002µs
rust-sink: stdout    size 0x800   : 242.003µs
rust-sink: stdout    size 0x1000  : 218.002µs
rust-sink: stdout    size 0x4000  : 168.004µs
rust-sink: stdout    size 0xa000  : 298.004µs
rust-sink: stdout    size 0x64000 : 178.002µs
rust-sink: stdout    size 0x3e8000: 259.001µs
rust-sink: stdout    Throughput:
rust-sink: stdout    size 0x0     : 29289 messages per second
rust-sink: stdout    size 0x8     : 63130 messages per second
rust-sink: stdout    size 0x40    : 49227 messages per second
rust-sink: stdout    size 0x200   : 22967 messages per second
rust-sink: stdout    size 0x800   : 34153 messages per second
rust-sink: stdout    size 0x1000  : 19395 messages per second
rust-sink: stdout    size 0x4000  : 22668 messages per second
rust-sink: stdout    size 0xa000  : 27095 messages per second
rust-sink: stdout    size 0x64000 : 7181 messages per second
rust-sink: stdout    size 0x3e8000: 1372 messages per second

Compared to main

2025-11-04T09:37:15.123888Z  INFO dora_daemon::log:    Latency:
2025-11-04T09:37:15.142632Z  INFO dora_daemon::log:    size 0x0     : 698.001µs
2025-11-04T09:37:15.155173Z  INFO dora_daemon::log:    size 0x8     : 671.002µs
2025-11-04T09:37:15.167162Z  INFO dora_daemon::log:    size 0x40    : 641.002µs
2025-11-04T09:37:15.179327Z  INFO dora_daemon::log:    size 0x200   : 611.001µs
2025-11-04T09:37:15.191265Z  INFO dora_daemon::log:    size 0x800   : 672.003µs 
2025-11-04T09:37:15.203911Z  INFO dora_daemon::log:    size 0x1000  : 641.001µs
2025-11-04T09:37:15.216509Z  INFO dora_daemon::log:    size 0x4000  : 650.004µs
2025-11-04T09:37:15.229286Z  INFO dora_daemon::log:    size 0xa000  : 617.004µs
2025-11-04T09:37:15.243167Z  INFO dora_daemon::log:    size 0x64000 : 656.001µs
2025-11-04T09:37:17.257741Z  INFO dora_daemon::log:    size 0x3e8000: 762.001µs
2025-11-04T09:37:17.263585Z  INFO dora_daemon::log:    Throughput: 
2025-11-04T09:37:17.299255Z  INFO dora_daemon::log:    size 0x0     : 2406 messages per second 
2025-11-04T09:37:19.302567Z  INFO dora_daemon::log:    size 0x8     : 2493 messages per second
2025-11-04T09:37:21.307250Z  INFO dora_daemon::log:    size 0x40    : 2558 messages per second
2025-11-04T09:37:23.315167Z  INFO dora_daemon::log:    size 0x200   : 2498 messages per second 
2025-11-04T09:37:25.320683Z  INFO dora_daemon::log:    size 0x800   : 2478 messages per second 
2025-11-04T09:37:27.335201Z  INFO dora_daemon::log:    size 0x1000  : 2383 messages per second 
2025-11-04T09:37:29.343684Z  INFO dora_daemon::log:    size 0x4000  : 2431 messages per second 
2025-11-04T09:37:31.349740Z  INFO dora_daemon::log:    size 0xa000  : 2507 messages per second 
2025-11-04T09:37:33.367012Z  INFO dora_daemon::log:    size 0x64000 : 2462 messages per second 
2025-11-04T09:37:35.404791Z  INFO dora_daemon::log:    size 0x3e8000: 1379 messages per second

Comparing performance in Python on dora-benchmark

This branch:

Date,Language,Dora Version,Platform,Name,Size (bit),Latency (μs)
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,64,880.2691700000001
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,512,527.42712
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,4096,665.76794
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,40960,741.38922
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,409600,701.74375
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,4096000,828.26458
2025-11-04 17:10:05.452307,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,40960000,2987.7891000000004

Main:

Date,Language,Dora Version,Platform,Name,Size (bit),Latency (μs)
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,64,1545.89127
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,512,1469.8870200000001
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,4096,1696.42843
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,40960,1393.2450200000003
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,409600,1525.2416500000002
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,4096000,1551.5992800000004
2025-11-04 15:55:40.329917,Python 3.12.8,0.3.13,macOS-15.3.2-arm64-arm-64bit,dora Node,40960000,3403.43585
2

haixuanTao avatar Nov 04 '25 09:11 haixuanTao

Interesting, thanks for looking into this!

identified that future::executor::block_on spawn a new thread pool which is very slow

Are you sure that this is what happens? I just took a look at the source code and it seems to just run the given closure in the current thread, without spawning any new threads:

  • block_on calls run_executor after pinning: https://docs.rs/futures-executor/0.3.31/src/futures_executor/local_pool.rs.html#314-317
  • run_executor sets an atomic flag through enter, then sets up a waker in a thread-local static to park/unpark the thread if needed: https://docs.rs/futures-executor/0.3.31/src/futures_executor/local_pool.rs.html#78-103

The docs for the block_on function also don't say that anything is spawned or that any thread pool is created:

Run a future to completion on the current thread.

This function will block the caller until the given future has completed.

So the open question is: Why is the tokio block_on function so much faster? I think it would be good to understand this, maybe it gives us some insight on how to improve the performance further.

phil-opp avatar Nov 05 '25 10:11 phil-opp

This needs a rebase to resolve the conflicts and remove the commits of the drain PR.

phil-opp avatar Nov 20 '25 11:11 phil-opp