tokio Performance regression in 1.45.0

Version 1.45.0. I also tried with 1.48.0, the regression is still present. 1.44.x is fine, 1.43.x as well.

Platform Linux latte-perf-debug-dev-loader-node-b2e23059-1 6.14.0-1017-aws #17~24.04.1-Ubuntu SMP Wed Nov 5 10:48:17 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Description

Tokio 1.45.0 and up introduced a significant performance regression, which we noticed using (our fork of) Scylla/Cassandra benchmarking tool latte.

Results for latte 0.32.0 with tokio 1.44.2

ubuntu@latte-perf-debug-dev-loader-node-b2e23059-1:~/latte_karol$ ./latte_0.32.0-tokio1.44.2 run ~/latte/simple.rn 10.0.10.196 -P keyspace="\"lattetablets\"" -P tablets=true -q -P row_count=50100100 -d=30s --consistency=LOCAL_QUORUM --threads=60 --concurrency=64 --connections=1 --s>
info: Loading workload script /home/ubuntu/latte/simple.rn...
info: Connecting to ["10.0.10.196"]... 
info: Connected to '' cluster running ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
info: Preparing...
info: Warming up...
info: Running benchmark...
CONFIG ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
            Date            Tue, 25 Nov 2025                                                          
            Time            14:49:48 +0000                                                            
         Cluster                                                                                      
      Datacenter                                                                                      
            Rack                                                                                      
      DB version            ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee                                            
        Workload            simple.rn                                                                 
     Function(s)            write:1, read:4                                                           
     Consistency            LocalQuorum                                                               
            Tags                                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     -P keyspace            "lattetablets"                                                            
    -P row_count            50100100                                                                  
      -P tablets            true                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         Threads                   60                                                                 
     Connections                    1                                                                 
     Concurrency     [req]         64                                                                 
        Max rate    [op/s]                                                                            
          Warmup       [s]                                                                            
              └─      [op]          1                                                                 
        Run time       [s]       30.0                                                                 
              └─      [op]                                                                            
        Sampling       [s]        5.0                                                                 
              └─      [op]                                                                            
 Request timeout       [s]          7                                                                 
         Retries                                                                                      
  ┌──────┴number                   10                                                                 
  ├─min interval      [ms]       1000                                                                 
  └─max interval      [ms]       5000                                                                 

LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Time    Cycles    Errors    Thrpt.     ────────────────────────────────── Latency [ms/op] ──────────────────────────────
     [s]      [op]      [op]    [op/s]             Min        50        75        90        95        99      99.9       Max
   5.002   7464708         0   1492498           0.216     2.232     3.049     3.916     4.608     8.593    34.898    65.831
  10.000   7576099         0   1515262           0.235     2.265     3.056     3.863     4.456     7.115    30.654    79.430
  15.002   7612968         0   1522633           0.225     2.241     3.031     3.836     4.424     7.234    32.784    77.529
  20.007   7605142         0   1520996           0.205     2.243     3.025     3.811     4.383     7.205    34.111    73.400
  25.000   7538566         0   1507792           0.217     2.245     3.049     3.883     4.510     7.578    33.047    81.920
  30.003   7524431         0   1504956           0.220     2.267     3.074     3.906     4.522     7.352    32.244    72.417
  30.003       280         0    124601           0.838    15.254    24.396    28.377    31.162    33.161    34.734    34.734

SUMMARY STATS ══════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Elapsed time       [s]     30.029                                                                 
        CPU time       [s]   1293.994                                                                 
 CPU utilisation       [%]       67.3                                                                 
          Cycles      [op]   45322194                                                                 
          Errors      [op]          0                                                                 
              └─       [%]        0.0                                                                 
        Requests     [req]   45322194                                                                 
              └─  [req/op]        1.0                                                                 
         Retries     [ret]          0                                                                 
              └─ [ret/req]    0.00000                                                                 
            Rows     [row]          0                                                                 
              └─ [row/req]        0.0                                                                 
     Concurrency     [req]         59 ± 1                                                             
              └─       [%]         92                                                                 
      Throughput    [op/s]    1509287 ± 31023                                                         
              ├─   [req/s]    1509287 ± 31023                                                         
              └─   [row/s]          0 ± 0                                                             
   Cycle latency      [ms]      2.541 ± 0.002                                                         
 Request latency      [ms]      2.388 ± 0.002                                                         

CYCLE LATENCY for write [ms]  ══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.225 ± 0.032                                                         
           25                   1.542 ± 0.002                                                         
           50                   2.165 ± 0.003                                                         
           75                   2.943 ± 0.003                                                         
           90                   3.770 ± 0.004                                                         
           95                   4.387 ± 0.007                                                         
           98                   5.718 ± 0.020                                                         
           99                   7.328 ± 0.050                                                         
           99.9                31.261 ± 0.517                                                         
           99.99               51.577 ± 1.054                                                         
          Max                  77.857 ± 1.137                                                         

CYCLE LATENCY for read [ms]  ═══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.205 ± 0.021                                                         
           25                   1.612 ± 0.002                                                         
           50                   2.271 ± 0.002                                                         
           75                   3.072 ± 0.002                                                         
           90                   3.891 ± 0.003                                                         
           95                   4.506 ± 0.005                                                         
           98                   5.890 ± 0.013                                                         
           99                   7.471 ± 0.028                                                         
           99.9                33.473 ± 0.267                                                         
           99.99               54.559 ± 0.694                                                         
          Max                  81.920 ± 0.932

Results for latte 0.32.0 with tokio 1.45.0

ubuntu@latte-perf-debug-dev-loader-node-b2e23059-1:~/latte_karol$ ./latte_0.32.0-tokio1.45.0 run ~/latte/simple.rn 10.0.10.196 -P keyspace="\"lattetablets\"" -P tablets=true -q -P row_count=50100100 -d=30s --consistency=LOCAL_QUORUM --threads=60 --concurrency=64 --connections=1 --s>
info: Loading workload script /home/ubuntu/latte/simple.rn...
info: Connecting to ["10.0.10.196"]... 
info: Connected to '' cluster running ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
info: Preparing...
info: Warming up...
info: Running benchmark...
CONFIG ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
            Date            Tue, 25 Nov 2025                                                          
            Time            14:45:38 +0000                                                            
         Cluster                                                                                      
      Datacenter                                                                                      
            Rack                                                                                      
      DB version            ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee                                            
        Workload            simple.rn                                                                 
     Function(s)            write:1, read:4                                                           
     Consistency            LocalQuorum                                                               
            Tags                                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     -P keyspace            "lattetablets"                                                            
    -P row_count            50100100                                                                  
      -P tablets            true                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         Threads                   60                                                                 
     Connections                    1                                                                 
     Concurrency     [req]         64                                                                 
        Max rate    [op/s]                                                                            
          Warmup       [s]                                                                            
              └─      [op]          1                                                                 
        Run time       [s]       30.0                                                                 
              └─      [op]                                                                            
        Sampling       [s]        5.0                                                                 
              └─      [op]                                                                            
 Request timeout       [s]          7                                                                 
         Retries                                                                                      
  ┌──────┴number                   10                                                                 
  ├─min interval      [ms]       1000                                                                 
  └─max interval      [ms]       5000                                                                 

LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Time    Cycles    Errors    Thrpt.     ────────────────────────────────── Latency [ms/op] ──────────────────────────────
     [s]      [op]      [op]    [op/s]             Min        50        75        90        95        99      99.9       Max
   5.006   1830326         0    365656           0.195     6.795    13.697    23.593    31.097    48.792    72.155   116.523
  10.001   1838507         0    367722           0.192     6.812    13.640    23.527    30.999    47.809    69.730    96.928
  15.010   1838173         0    367577           0.220     6.705    13.476    23.396    31.179    50.102    79.626   145.228
  20.009   1837602         0    367612           0.211     6.722    13.541    23.495    31.244    49.545    74.973   114.885
^C  21.611    588675         0    367244           0.240     6.857    13.615    23.314    30.720    48.267    71.369    88.015

SUMMARY STATS ══════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Elapsed time       [s]     21.633                                                                 
        CPU time       [s]    226.948                                                                 
 CPU utilisation       [%]       16.4                                                                 
          Cycles      [op]    7933283                                                                 
          Errors      [op]          0                                                                 
              └─       [%]        0.0                                                                 
        Requests     [req]    7933283                                                                 
              └─  [req/op]        1.0                                                                 
         Retries     [ret]          0                                                                 
              └─ [ret/req]    0.00000                                                                 
            Rows     [row]          0                                                                 
              └─ [row/req]        0.0                                                                 
     Concurrency     [req]         59 ± 0                                                             
              └─       [%]         91                                                                 
      Throughput    [op/s]     366727 ± 3603                                                          
              ├─   [req/s]     366727 ± 3603                                                          
              └─   [row/s]          0 ± 0                                                             
   Cycle latency      [ms]     10.452 ± 0.017                                                         
 Request latency      [ms]      9.753 ± 0.018                                                         

CYCLE LATENCY for write [ms]  ══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.192 ± 0.054                                                         
           25                   3.803 ± 0.012                                                         
           50                   6.730 ± 0.025                                                         
           75                  13.451 ± 0.051                                                         
           90                  23.249 ± 0.096                                                         
           95                  30.835 ± 0.169                                                         
           98                  41.255 ± 0.220                                                         
           99                  48.660 ± 0.301                                                         
           99.9                73.269 ± 0.916                                                         
           99.99               97.124 ± 3.592                                                         
          Max                 142.606 ± 2.922                                                         

CYCLE LATENCY for read [ms]  ═══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.195 ± 0.018                                                         
           25                   3.799 ± 0.008                                                         
           50                   6.775 ± 0.016                                                         
           75                  13.623 ± 0.035                                                         
           90                  23.544 ± 0.061                                                         
           95                  31.162 ± 0.089                                                         
           98                  41.484 ± 0.149                                                         
           99                  49.054 ± 0.213                                                         
           99.9                73.990 ± 0.639                                                         
           99.99               98.042 ± 1.931                                                         
          Max                 145.228 ± 3.017

The interesting part is the throughput. For tokio 1.44.2, we can do about 1.5M ops/s. For 1.45.0, it is only about 367k ops/s. 75% reduction in performance!

I generated flamegraphs for good and bad run. Bad:

Good:

Basically the only difference is a bigger presence of tokio::time::sleep - 20% vs 11.5%. The only sleep-related change I see in 1.45 is https://github.com/tokio-rs/tokio/pull/7226 , but I don't know Tokio well enough to conclusively say this is the cause.

Is that a known issue? Can I somehow help with debugging / fixing it?

Nov 25 '25 15:11 Lorak-mmk

What you are seeing is probably #6504. You mentioned #7226, which is a revert of #6534 that attempted to improve timer performance. Unfortunately, it was reverted because it was hurting performance for others.

There is an alternative PR at #7467 that is currently WIP. You can help by providing benchmarks.

Nov 25 '25 15:11 Darksonn

I added this section to Cargo.toml of latte:

[patch.crates-io]
tokio = { git = "https://github.com/ADD-SP/tokio", branch = "add_sp/time-local-wheel"

then executed cargo update tokio. I verified that it worked:

cargo tree | grep tokio
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│   │   │   └── tokio-macros v2.6.0 (proc-macro) (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   ├── tokio-openssl v0.6.5
│   │   └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
├── tokio-stream v0.1.17
│   └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│   │   ├── tokio-util v0.7.15
│   │   │   └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   └── tokio-native-tls v0.3.1
│   │       └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   ├── tokio-native-tls v0.3.1 (*)
└── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)

I built latte and tested it. Unfortunately it did not help - I got about 340k req/s.

Nov 25 '25 16:11 Lorak-mmk

I believe @ADD-SP already implemented https://github.com/tokio-rs/tokio/pull/7467#issuecomment-3563928880, so you probably have to explicitly ask for the different timer implementation to use it.

Nov 25 '25 16:11 Darksonn

Ah, I see. With tokio-unstable and .enable_alt_timer() I see the difference. Stats from few runs:

   5.002   6389507         0   1277492           0.215     2.900     3.609     4.317     4.825     6.672    19.333    73.466
  10.001   6886487         0   1377408           0.238     2.689     3.414     4.125     4.624     6.611    30.130    73.466
  15.002   7779478         0   1555738           0.224     2.206     2.931     3.672     4.256    10.559    51.151    85.066
  20.004   6958590         0   1391062           0.232     2.703     3.420     4.086     4.526     6.242    24.969    88.801
  24.999   7673983         0   1535602           0.228     2.357     3.084     3.797     4.321     6.996    42.238    88.539
  29.999   6979597         0   1396735           0.229     2.605     3.400     4.116     4.588     6.619    55.050    87.818


   5.002   6359799         0   1271603           0.220     2.902     3.635     4.366     4.903     6.730    18.891    72.024
  10.003   6867477         0   1373502           0.216     2.710     3.387     4.069     4.571     6.210     9.126    22.495
  15.003   7182083         0   1436200           0.235     2.568     3.285     3.977     4.461     6.758    29.917    69.272
  20.001   7532050         0   1506569           0.215     2.390     3.062     3.760     4.276     7.377    43.614    84.673
  25.002   7936488         0   1587539           0.232     2.206     2.945     3.676     4.227     8.397    52.986    74.711
  29.998   7493779         0   1499664           0.244     2.472     3.158     3.830     4.305     6.160    34.341    68.813


   5.002   6526652         0   1304939           0.201     2.826     3.545     4.260     4.772     6.545    21.856    66.421
  10.001   6651023         0   1330196           0.217     2.810     3.500     4.182     4.657     6.169     8.864    22.282
  15.001   6762817         0   1352596           0.243     2.779     3.432     4.069     4.502     5.939     8.692    21.758
  20.002   7089545         0   1417832           0.231     2.626     3.287     3.934     4.387     6.066    12.378    66.847
  25.000   7204327         0   1440944           0.240     2.576     3.217     3.865     4.338     6.021    10.445    25.018
  30.000   7337415         0   1468018           0.239     2.523     3.162     3.801     4.260     5.997    12.599    33.915


   5.002   6674579         0   1334492           0.234     2.746     3.478     4.188     4.690     6.500    26.591    75.891
  10.001   6775553         0   1355112           0.232     2.742     3.426     4.112     4.604     6.312    10.011    23.871
  15.003   6955734         0   1391044           0.248     2.671     3.332     3.994     4.477     6.234    10.625    33.980
  20.001   7098290         0   1419698           0.244     2.615     3.273     3.928     4.403     6.177    11.239    47.776
  25.001   7204740         0   1440925           0.232     2.585     3.213     3.832     4.268     5.943    10.961    36.340
  30.003   7226808         0   1445671           0.225     2.568     3.209     3.848     4.313     6.025    11.698    36.962

Each row represents stats from a 5-second fragment of a run., So it looks like performance is:

much better than 1.45-1.48, so the regression is mostly gone
slightly worse that 1.44 on average
much more volatile than 1.44.2. In <=1.44.2, each 5-second fragment would have very similar results (fragments would differ by 1-2 k ops/s), but now it can vary from 1277k to 1535k in single execution.

Nov 25 '25 17:11 Lorak-mmk

much more volatile than 1.44.2. In <=1.44.2, each 5-second fragment would have very similar results (fragments would differ by 1-2 k ops/s), but now it can vary from 1277k to 1535k in single execution.

This variation might be caused by the extra heap allocation of the alternative timer implementation.

Nov 26 '25 11:11 ADD-SP