Performance regression in 1.45.0
Version 1.45.0. I also tried with 1.48.0, the regression is still present. 1.44.x is fine, 1.43.x as well.
Platform
Linux latte-perf-debug-dev-loader-node-b2e23059-1 6.14.0-1017-aws #17~24.04.1-Ubuntu SMP Wed Nov 5 10:48:17 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Description
Tokio 1.45.0 and up introduced a significant performance regression, which we noticed using (our fork of) Scylla/Cassandra benchmarking tool latte.
Results for latte 0.32.0 with tokio 1.44.2
ubuntu@latte-perf-debug-dev-loader-node-b2e23059-1:~/latte_karol$ ./latte_0.32.0-tokio1.44.2 run ~/latte/simple.rn 10.0.10.196 -P keyspace="\"lattetablets\"" -P tablets=true -q -P row_count=50100100 -d=30s --consistency=LOCAL_QUORUM --threads=60 --concurrency=64 --connections=1 --s>
info: Loading workload script /home/ubuntu/latte/simple.rn...
info: Connecting to ["10.0.10.196"]...
info: Connected to '' cluster running ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
info: Preparing...
info: Warming up...
info: Running benchmark...
CONFIG ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Date Tue, 25 Nov 2025
Time 14:49:48 +0000
Cluster
Datacenter
Rack
DB version ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
Workload simple.rn
Function(s) write:1, read:4
Consistency LocalQuorum
Tags
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-P keyspace "lattetablets"
-P row_count 50100100
-P tablets true
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Threads 60
Connections 1
Concurrency [req] 64
Max rate [op/s]
Warmup [s]
└─ [op] 1
Run time [s] 30.0
└─ [op]
Sampling [s] 5.0
└─ [op]
Request timeout [s] 7
Retries
┌──────┴number 10
├─min interval [ms] 1000
└─max interval [ms] 5000
LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Time Cycles Errors Thrpt. ────────────────────────────────── Latency [ms/op] ──────────────────────────────
[s] [op] [op] [op/s] Min 50 75 90 95 99 99.9 Max
5.002 7464708 0 1492498 0.216 2.232 3.049 3.916 4.608 8.593 34.898 65.831
10.000 7576099 0 1515262 0.235 2.265 3.056 3.863 4.456 7.115 30.654 79.430
15.002 7612968 0 1522633 0.225 2.241 3.031 3.836 4.424 7.234 32.784 77.529
20.007 7605142 0 1520996 0.205 2.243 3.025 3.811 4.383 7.205 34.111 73.400
25.000 7538566 0 1507792 0.217 2.245 3.049 3.883 4.510 7.578 33.047 81.920
30.003 7524431 0 1504956 0.220 2.267 3.074 3.906 4.522 7.352 32.244 72.417
30.003 280 0 124601 0.838 15.254 24.396 28.377 31.162 33.161 34.734 34.734
SUMMARY STATS ══════════════════════════════════════════════════════════════════════════════════════════════════════════════
Elapsed time [s] 30.029
CPU time [s] 1293.994
CPU utilisation [%] 67.3
Cycles [op] 45322194
Errors [op] 0
└─ [%] 0.0
Requests [req] 45322194
└─ [req/op] 1.0
Retries [ret] 0
└─ [ret/req] 0.00000
Rows [row] 0
└─ [row/req] 0.0
Concurrency [req] 59 ± 1
└─ [%] 92
Throughput [op/s] 1509287 ± 31023
├─ [req/s] 1509287 ± 31023
└─ [row/s] 0 ± 0
Cycle latency [ms] 2.541 ± 0.002
Request latency [ms] 2.388 ± 0.002
CYCLE LATENCY for write [ms] ══════════════════════════════════════════════════════════════════════════════════════════════
Min 0.225 ± 0.032
25 1.542 ± 0.002
50 2.165 ± 0.003
75 2.943 ± 0.003
90 3.770 ± 0.004
95 4.387 ± 0.007
98 5.718 ± 0.020
99 7.328 ± 0.050
99.9 31.261 ± 0.517
99.99 51.577 ± 1.054
Max 77.857 ± 1.137
CYCLE LATENCY for read [ms] ═══════════════════════════════════════════════════════════════════════════════════════════════
Min 0.205 ± 0.021
25 1.612 ± 0.002
50 2.271 ± 0.002
75 3.072 ± 0.002
90 3.891 ± 0.003
95 4.506 ± 0.005
98 5.890 ± 0.013
99 7.471 ± 0.028
99.9 33.473 ± 0.267
99.99 54.559 ± 0.694
Max 81.920 ± 0.932
Results for latte 0.32.0 with tokio 1.45.0
ubuntu@latte-perf-debug-dev-loader-node-b2e23059-1:~/latte_karol$ ./latte_0.32.0-tokio1.45.0 run ~/latte/simple.rn 10.0.10.196 -P keyspace="\"lattetablets\"" -P tablets=true -q -P row_count=50100100 -d=30s --consistency=LOCAL_QUORUM --threads=60 --concurrency=64 --connections=1 --s>
info: Loading workload script /home/ubuntu/latte/simple.rn...
info: Connecting to ["10.0.10.196"]...
info: Connected to '' cluster running ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
info: Preparing...
info: Warming up...
info: Running benchmark...
CONFIG ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Date Tue, 25 Nov 2025
Time 14:45:38 +0000
Cluster
Datacenter
Rack
DB version ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
Workload simple.rn
Function(s) write:1, read:4
Consistency LocalQuorum
Tags
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-P keyspace "lattetablets"
-P row_count 50100100
-P tablets true
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Threads 60
Connections 1
Concurrency [req] 64
Max rate [op/s]
Warmup [s]
└─ [op] 1
Run time [s] 30.0
└─ [op]
Sampling [s] 5.0
└─ [op]
Request timeout [s] 7
Retries
┌──────┴number 10
├─min interval [ms] 1000
└─max interval [ms] 5000
LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Time Cycles Errors Thrpt. ────────────────────────────────── Latency [ms/op] ──────────────────────────────
[s] [op] [op] [op/s] Min 50 75 90 95 99 99.9 Max
5.006 1830326 0 365656 0.195 6.795 13.697 23.593 31.097 48.792 72.155 116.523
10.001 1838507 0 367722 0.192 6.812 13.640 23.527 30.999 47.809 69.730 96.928
15.010 1838173 0 367577 0.220 6.705 13.476 23.396 31.179 50.102 79.626 145.228
20.009 1837602 0 367612 0.211 6.722 13.541 23.495 31.244 49.545 74.973 114.885
^C 21.611 588675 0 367244 0.240 6.857 13.615 23.314 30.720 48.267 71.369 88.015
SUMMARY STATS ══════════════════════════════════════════════════════════════════════════════════════════════════════════════
Elapsed time [s] 21.633
CPU time [s] 226.948
CPU utilisation [%] 16.4
Cycles [op] 7933283
Errors [op] 0
└─ [%] 0.0
Requests [req] 7933283
└─ [req/op] 1.0
Retries [ret] 0
└─ [ret/req] 0.00000
Rows [row] 0
└─ [row/req] 0.0
Concurrency [req] 59 ± 0
└─ [%] 91
Throughput [op/s] 366727 ± 3603
├─ [req/s] 366727 ± 3603
└─ [row/s] 0 ± 0
Cycle latency [ms] 10.452 ± 0.017
Request latency [ms] 9.753 ± 0.018
CYCLE LATENCY for write [ms] ══════════════════════════════════════════════════════════════════════════════════════════════
Min 0.192 ± 0.054
25 3.803 ± 0.012
50 6.730 ± 0.025
75 13.451 ± 0.051
90 23.249 ± 0.096
95 30.835 ± 0.169
98 41.255 ± 0.220
99 48.660 ± 0.301
99.9 73.269 ± 0.916
99.99 97.124 ± 3.592
Max 142.606 ± 2.922
CYCLE LATENCY for read [ms] ═══════════════════════════════════════════════════════════════════════════════════════════════
Min 0.195 ± 0.018
25 3.799 ± 0.008
50 6.775 ± 0.016
75 13.623 ± 0.035
90 23.544 ± 0.061
95 31.162 ± 0.089
98 41.484 ± 0.149
99 49.054 ± 0.213
99.9 73.990 ± 0.639
99.99 98.042 ± 1.931
Max 145.228 ± 3.017
The interesting part is the throughput. For tokio 1.44.2, we can do about 1.5M ops/s. For 1.45.0, it is only about 367k ops/s. 75% reduction in performance!
I generated flamegraphs for good and bad run. Bad:
Good:
Basically the only difference is a bigger presence of tokio::time::sleep - 20% vs 11.5%.
The only sleep-related change I see in 1.45 is https://github.com/tokio-rs/tokio/pull/7226 , but I don't know Tokio well enough to conclusively say this is the cause.
Is that a known issue? Can I somehow help with debugging / fixing it?
What you are seeing is probably #6504. You mentioned #7226, which is a revert of #6534 that attempted to improve timer performance. Unfortunately, it was reverted because it was hurting performance for others.
There is an alternative PR at #7467 that is currently WIP. You can help by providing benchmarks.
I added this section to Cargo.toml of latte:
[patch.crates-io]
tokio = { git = "https://github.com/ADD-SP/tokio", branch = "add_sp/time-local-wheel"
then executed cargo update tokio.
I verified that it worked:
cargo tree | grep tokio
│ │ ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│ │ │ └── tokio-macros v2.6.0 (proc-macro) (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│ ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ ├── tokio-openssl v0.6.5
│ │ └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
├── tokio-stream v0.1.17
│ └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ │ ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│ │ ├── tokio-util v0.7.15
│ │ │ └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ │ ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ │ ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ │ └── tokio-native-tls v0.3.1
│ │ └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│ ├── tokio-native-tls v0.3.1 (*)
└── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
I built latte and tested it. Unfortunately it did not help - I got about 340k req/s.
I believe @ADD-SP already implemented https://github.com/tokio-rs/tokio/pull/7467#issuecomment-3563928880, so you probably have to explicitly ask for the different timer implementation to use it.
Ah, I see. With tokio-unstable and .enable_alt_timer() I see the difference.
Stats from few runs:
5.002 6389507 0 1277492 0.215 2.900 3.609 4.317 4.825 6.672 19.333 73.466
10.001 6886487 0 1377408 0.238 2.689 3.414 4.125 4.624 6.611 30.130 73.466
15.002 7779478 0 1555738 0.224 2.206 2.931 3.672 4.256 10.559 51.151 85.066
20.004 6958590 0 1391062 0.232 2.703 3.420 4.086 4.526 6.242 24.969 88.801
24.999 7673983 0 1535602 0.228 2.357 3.084 3.797 4.321 6.996 42.238 88.539
29.999 6979597 0 1396735 0.229 2.605 3.400 4.116 4.588 6.619 55.050 87.818
5.002 6359799 0 1271603 0.220 2.902 3.635 4.366 4.903 6.730 18.891 72.024
10.003 6867477 0 1373502 0.216 2.710 3.387 4.069 4.571 6.210 9.126 22.495
15.003 7182083 0 1436200 0.235 2.568 3.285 3.977 4.461 6.758 29.917 69.272
20.001 7532050 0 1506569 0.215 2.390 3.062 3.760 4.276 7.377 43.614 84.673
25.002 7936488 0 1587539 0.232 2.206 2.945 3.676 4.227 8.397 52.986 74.711
29.998 7493779 0 1499664 0.244 2.472 3.158 3.830 4.305 6.160 34.341 68.813
5.002 6526652 0 1304939 0.201 2.826 3.545 4.260 4.772 6.545 21.856 66.421
10.001 6651023 0 1330196 0.217 2.810 3.500 4.182 4.657 6.169 8.864 22.282
15.001 6762817 0 1352596 0.243 2.779 3.432 4.069 4.502 5.939 8.692 21.758
20.002 7089545 0 1417832 0.231 2.626 3.287 3.934 4.387 6.066 12.378 66.847
25.000 7204327 0 1440944 0.240 2.576 3.217 3.865 4.338 6.021 10.445 25.018
30.000 7337415 0 1468018 0.239 2.523 3.162 3.801 4.260 5.997 12.599 33.915
5.002 6674579 0 1334492 0.234 2.746 3.478 4.188 4.690 6.500 26.591 75.891
10.001 6775553 0 1355112 0.232 2.742 3.426 4.112 4.604 6.312 10.011 23.871
15.003 6955734 0 1391044 0.248 2.671 3.332 3.994 4.477 6.234 10.625 33.980
20.001 7098290 0 1419698 0.244 2.615 3.273 3.928 4.403 6.177 11.239 47.776
25.001 7204740 0 1440925 0.232 2.585 3.213 3.832 4.268 5.943 10.961 36.340
30.003 7226808 0 1445671 0.225 2.568 3.209 3.848 4.313 6.025 11.698 36.962
Each row represents stats from a 5-second fragment of a run., So it looks like performance is:
- much better than 1.45-1.48, so the regression is mostly gone
- slightly worse that 1.44 on average
- much more volatile than 1.44.2. In <=1.44.2, each 5-second fragment would have very similar results (fragments would differ by 1-2 k ops/s), but now it can vary from 1277k to 1535k in single execution.
much more volatile than 1.44.2. In <=1.44.2, each 5-second fragment would have very similar results (fragments would differ by 1-2 k ops/s), but now it can vary from 1277k to 1535k in single execution.
This variation might be caused by the extra heap allocation of the alternative timer implementation.