tokio icon indicating copy to clipboard operation
tokio copied to clipboard

Performance regression in 1.45.0

Open Lorak-mmk opened this issue 5 months ago • 5 comments

Version 1.45.0. I also tried with 1.48.0, the regression is still present. 1.44.x is fine, 1.43.x as well.

Platform Linux latte-perf-debug-dev-loader-node-b2e23059-1 6.14.0-1017-aws #17~24.04.1-Ubuntu SMP Wed Nov 5 10:48:17 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Description

Tokio 1.45.0 and up introduced a significant performance regression, which we noticed using (our fork of) Scylla/Cassandra benchmarking tool latte.

Results for latte 0.32.0 with tokio 1.44.2
ubuntu@latte-perf-debug-dev-loader-node-b2e23059-1:~/latte_karol$ ./latte_0.32.0-tokio1.44.2 run ~/latte/simple.rn 10.0.10.196 -P keyspace="\"lattetablets\"" -P tablets=true -q -P row_count=50100100 -d=30s --consistency=LOCAL_QUORUM --threads=60 --concurrency=64 --connections=1 --s>
info: Loading workload script /home/ubuntu/latte/simple.rn...
info: Connecting to ["10.0.10.196"]... 
info: Connected to '' cluster running ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
info: Preparing...
info: Warming up...
info: Running benchmark...
CONFIG ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
            Date            Tue, 25 Nov 2025                                                          
            Time            14:49:48 +0000                                                            
         Cluster                                                                                      
      Datacenter                                                                                      
            Rack                                                                                      
      DB version            ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee                                            
        Workload            simple.rn                                                                 
     Function(s)            write:1, read:4                                                           
     Consistency            LocalQuorum                                                               
            Tags                                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     -P keyspace            "lattetablets"                                                            
    -P row_count            50100100                                                                  
      -P tablets            true                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         Threads                   60                                                                 
     Connections                    1                                                                 
     Concurrency     [req]         64                                                                 
        Max rate    [op/s]                                                                            
          Warmup       [s]                                                                            
              └─      [op]          1                                                                 
        Run time       [s]       30.0                                                                 
              └─      [op]                                                                            
        Sampling       [s]        5.0                                                                 
              └─      [op]                                                                            
 Request timeout       [s]          7                                                                 
         Retries                                                                                      
  ┌──────┴number                   10                                                                 
  ├─min interval      [ms]       1000                                                                 
  └─max interval      [ms]       5000                                                                 

LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Time    Cycles    Errors    Thrpt.     ────────────────────────────────── Latency [ms/op] ──────────────────────────────
     [s]      [op]      [op]    [op/s]             Min        50        75        90        95        99      99.9       Max
   5.002   7464708         0   1492498           0.216     2.232     3.049     3.916     4.608     8.593    34.898    65.831
  10.000   7576099         0   1515262           0.235     2.265     3.056     3.863     4.456     7.115    30.654    79.430
  15.002   7612968         0   1522633           0.225     2.241     3.031     3.836     4.424     7.234    32.784    77.529
  20.007   7605142         0   1520996           0.205     2.243     3.025     3.811     4.383     7.205    34.111    73.400
  25.000   7538566         0   1507792           0.217     2.245     3.049     3.883     4.510     7.578    33.047    81.920
  30.003   7524431         0   1504956           0.220     2.267     3.074     3.906     4.522     7.352    32.244    72.417
  30.003       280         0    124601           0.838    15.254    24.396    28.377    31.162    33.161    34.734    34.734

SUMMARY STATS ══════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Elapsed time       [s]     30.029                                                                 
        CPU time       [s]   1293.994                                                                 
 CPU utilisation       [%]       67.3                                                                 
          Cycles      [op]   45322194                                                                 
          Errors      [op]          0                                                                 
              └─       [%]        0.0                                                                 
        Requests     [req]   45322194                                                                 
              └─  [req/op]        1.0                                                                 
         Retries     [ret]          0                                                                 
              └─ [ret/req]    0.00000                                                                 
            Rows     [row]          0                                                                 
              └─ [row/req]        0.0                                                                 
     Concurrency     [req]         59 ± 1                                                             
              └─       [%]         92                                                                 
      Throughput    [op/s]    1509287 ± 31023                                                         
              ├─   [req/s]    1509287 ± 31023                                                         
              └─   [row/s]          0 ± 0                                                             
   Cycle latency      [ms]      2.541 ± 0.002                                                         
 Request latency      [ms]      2.388 ± 0.002                                                         

CYCLE LATENCY for write [ms]  ══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.225 ± 0.032                                                         
           25                   1.542 ± 0.002                                                         
           50                   2.165 ± 0.003                                                         
           75                   2.943 ± 0.003                                                         
           90                   3.770 ± 0.004                                                         
           95                   4.387 ± 0.007                                                         
           98                   5.718 ± 0.020                                                         
           99                   7.328 ± 0.050                                                         
           99.9                31.261 ± 0.517                                                         
           99.99               51.577 ± 1.054                                                         
          Max                  77.857 ± 1.137                                                         

CYCLE LATENCY for read [ms]  ═══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.205 ± 0.021                                                         
           25                   1.612 ± 0.002                                                         
           50                   2.271 ± 0.002                                                         
           75                   3.072 ± 0.002                                                         
           90                   3.891 ± 0.003                                                         
           95                   4.506 ± 0.005                                                         
           98                   5.890 ± 0.013                                                         
           99                   7.471 ± 0.028                                                         
           99.9                33.473 ± 0.267                                                         
           99.99               54.559 ± 0.694                                                         
          Max                  81.920 ± 0.932                                                         
Results for latte 0.32.0 with tokio 1.45.0
ubuntu@latte-perf-debug-dev-loader-node-b2e23059-1:~/latte_karol$ ./latte_0.32.0-tokio1.45.0 run ~/latte/simple.rn 10.0.10.196 -P keyspace="\"lattetablets\"" -P tablets=true -q -P row_count=50100100 -d=30s --consistency=LOCAL_QUORUM --threads=60 --concurrency=64 --connections=1 --s>
info: Loading workload script /home/ubuntu/latte/simple.rn...
info: Connecting to ["10.0.10.196"]... 
info: Connected to '' cluster running ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee
info: Preparing...
info: Warming up...
info: Running benchmark...
CONFIG ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
            Date            Tue, 25 Nov 2025                                                          
            Time            14:45:38 +0000                                                            
         Cluster                                                                                      
      Datacenter                                                                                      
            Rack                                                                                      
      DB version            ScyllaDB 2025.3.3-0.20251029.0e6381f14db2 with build-id 05d55ab705a3d90b467774912f5eaa5ff495f1ee                                            
        Workload            simple.rn                                                                 
     Function(s)            write:1, read:4                                                           
     Consistency            LocalQuorum                                                               
            Tags                                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     -P keyspace            "lattetablets"                                                            
    -P row_count            50100100                                                                  
      -P tablets            true                                                                      
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         Threads                   60                                                                 
     Connections                    1                                                                 
     Concurrency     [req]         64                                                                 
        Max rate    [op/s]                                                                            
          Warmup       [s]                                                                            
              └─      [op]          1                                                                 
        Run time       [s]       30.0                                                                 
              └─      [op]                                                                            
        Sampling       [s]        5.0                                                                 
              └─      [op]                                                                            
 Request timeout       [s]          7                                                                 
         Retries                                                                                      
  ┌──────┴number                   10                                                                 
  ├─min interval      [ms]       1000                                                                 
  └─max interval      [ms]       5000                                                                 

LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Time    Cycles    Errors    Thrpt.     ────────────────────────────────── Latency [ms/op] ──────────────────────────────
     [s]      [op]      [op]    [op/s]             Min        50        75        90        95        99      99.9       Max
   5.006   1830326         0    365656           0.195     6.795    13.697    23.593    31.097    48.792    72.155   116.523
  10.001   1838507         0    367722           0.192     6.812    13.640    23.527    30.999    47.809    69.730    96.928
  15.010   1838173         0    367577           0.220     6.705    13.476    23.396    31.179    50.102    79.626   145.228
  20.009   1837602         0    367612           0.211     6.722    13.541    23.495    31.244    49.545    74.973   114.885
^C  21.611    588675         0    367244           0.240     6.857    13.615    23.314    30.720    48.267    71.369    88.015

SUMMARY STATS ══════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Elapsed time       [s]     21.633                                                                 
        CPU time       [s]    226.948                                                                 
 CPU utilisation       [%]       16.4                                                                 
          Cycles      [op]    7933283                                                                 
          Errors      [op]          0                                                                 
              └─       [%]        0.0                                                                 
        Requests     [req]    7933283                                                                 
              └─  [req/op]        1.0                                                                 
         Retries     [ret]          0                                                                 
              └─ [ret/req]    0.00000                                                                 
            Rows     [row]          0                                                                 
              └─ [row/req]        0.0                                                                 
     Concurrency     [req]         59 ± 0                                                             
              └─       [%]         91                                                                 
      Throughput    [op/s]     366727 ± 3603                                                          
              ├─   [req/s]     366727 ± 3603                                                          
              └─   [row/s]          0 ± 0                                                             
   Cycle latency      [ms]     10.452 ± 0.017                                                         
 Request latency      [ms]      9.753 ± 0.018                                                         

CYCLE LATENCY for write [ms]  ══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.192 ± 0.054                                                         
           25                   3.803 ± 0.012                                                         
           50                   6.730 ± 0.025                                                         
           75                  13.451 ± 0.051                                                         
           90                  23.249 ± 0.096                                                         
           95                  30.835 ± 0.169                                                         
           98                  41.255 ± 0.220                                                         
           99                  48.660 ± 0.301                                                         
           99.9                73.269 ± 0.916                                                         
           99.99               97.124 ± 3.592                                                         
          Max                 142.606 ± 2.922                                                         

CYCLE LATENCY for read [ms]  ═══════════════════════════════════════════════════════════════════════════════════════════════
          Min                   0.195 ± 0.018                                                         
           25                   3.799 ± 0.008                                                         
           50                   6.775 ± 0.016                                                         
           75                  13.623 ± 0.035                                                         
           90                  23.544 ± 0.061                                                         
           95                  31.162 ± 0.089                                                         
           98                  41.484 ± 0.149                                                         
           99                  49.054 ± 0.213                                                         
           99.9                73.990 ± 0.639                                                         
           99.99               98.042 ± 1.931                                                         
          Max                 145.228 ± 3.017                                       

The interesting part is the throughput. For tokio 1.44.2, we can do about 1.5M ops/s. For 1.45.0, it is only about 367k ops/s. 75% reduction in performance!

I generated flamegraphs for good and bad run. Bad:

Image

Good:

Image

Basically the only difference is a bigger presence of tokio::time::sleep - 20% vs 11.5%. The only sleep-related change I see in 1.45 is https://github.com/tokio-rs/tokio/pull/7226 , but I don't know Tokio well enough to conclusively say this is the cause.

Is that a known issue? Can I somehow help with debugging / fixing it?

Lorak-mmk avatar Nov 25 '25 15:11 Lorak-mmk

What you are seeing is probably #6504. You mentioned #7226, which is a revert of #6534 that attempted to improve timer performance. Unfortunately, it was reverted because it was hurting performance for others.

There is an alternative PR at #7467 that is currently WIP. You can help by providing benchmarks.

Darksonn avatar Nov 25 '25 15:11 Darksonn

I added this section to Cargo.toml of latte:

[patch.crates-io]
tokio = { git = "https://github.com/ADD-SP/tokio", branch = "add_sp/time-local-wheel"

then executed cargo update tokio. I verified that it worked:

cargo tree | grep tokio
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│   │   │   └── tokio-macros v2.6.0 (proc-macro) (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   ├── tokio-openssl v0.6.5
│   │   └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
├── tokio-stream v0.1.17
│   └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c)
│   │   ├── tokio-util v0.7.15
│   │   │   └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   │   └── tokio-native-tls v0.3.1
│   │       └── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   ├── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)
│   ├── tokio-native-tls v0.3.1 (*)
└── tokio v1.48.0 (https://github.com/ADD-SP/tokio?branch=add_sp%2Ftime-local-wheel#804f799c) (*)

I built latte and tested it. Unfortunately it did not help - I got about 340k req/s.

Lorak-mmk avatar Nov 25 '25 16:11 Lorak-mmk

I believe @ADD-SP already implemented https://github.com/tokio-rs/tokio/pull/7467#issuecomment-3563928880, so you probably have to explicitly ask for the different timer implementation to use it.

Darksonn avatar Nov 25 '25 16:11 Darksonn

Ah, I see. With tokio-unstable and .enable_alt_timer() I see the difference. Stats from few runs:

   5.002   6389507         0   1277492           0.215     2.900     3.609     4.317     4.825     6.672    19.333    73.466
  10.001   6886487         0   1377408           0.238     2.689     3.414     4.125     4.624     6.611    30.130    73.466
  15.002   7779478         0   1555738           0.224     2.206     2.931     3.672     4.256    10.559    51.151    85.066
  20.004   6958590         0   1391062           0.232     2.703     3.420     4.086     4.526     6.242    24.969    88.801
  24.999   7673983         0   1535602           0.228     2.357     3.084     3.797     4.321     6.996    42.238    88.539
  29.999   6979597         0   1396735           0.229     2.605     3.400     4.116     4.588     6.619    55.050    87.818


   5.002   6359799         0   1271603           0.220     2.902     3.635     4.366     4.903     6.730    18.891    72.024
  10.003   6867477         0   1373502           0.216     2.710     3.387     4.069     4.571     6.210     9.126    22.495
  15.003   7182083         0   1436200           0.235     2.568     3.285     3.977     4.461     6.758    29.917    69.272
  20.001   7532050         0   1506569           0.215     2.390     3.062     3.760     4.276     7.377    43.614    84.673
  25.002   7936488         0   1587539           0.232     2.206     2.945     3.676     4.227     8.397    52.986    74.711
  29.998   7493779         0   1499664           0.244     2.472     3.158     3.830     4.305     6.160    34.341    68.813


   5.002   6526652         0   1304939           0.201     2.826     3.545     4.260     4.772     6.545    21.856    66.421
  10.001   6651023         0   1330196           0.217     2.810     3.500     4.182     4.657     6.169     8.864    22.282
  15.001   6762817         0   1352596           0.243     2.779     3.432     4.069     4.502     5.939     8.692    21.758
  20.002   7089545         0   1417832           0.231     2.626     3.287     3.934     4.387     6.066    12.378    66.847
  25.000   7204327         0   1440944           0.240     2.576     3.217     3.865     4.338     6.021    10.445    25.018
  30.000   7337415         0   1468018           0.239     2.523     3.162     3.801     4.260     5.997    12.599    33.915


   5.002   6674579         0   1334492           0.234     2.746     3.478     4.188     4.690     6.500    26.591    75.891
  10.001   6775553         0   1355112           0.232     2.742     3.426     4.112     4.604     6.312    10.011    23.871
  15.003   6955734         0   1391044           0.248     2.671     3.332     3.994     4.477     6.234    10.625    33.980
  20.001   7098290         0   1419698           0.244     2.615     3.273     3.928     4.403     6.177    11.239    47.776
  25.001   7204740         0   1440925           0.232     2.585     3.213     3.832     4.268     5.943    10.961    36.340
  30.003   7226808         0   1445671           0.225     2.568     3.209     3.848     4.313     6.025    11.698    36.962


Each row represents stats from a 5-second fragment of a run., So it looks like performance is:

  • much better than 1.45-1.48, so the regression is mostly gone
  • slightly worse that 1.44 on average
  • much more volatile than 1.44.2. In <=1.44.2, each 5-second fragment would have very similar results (fragments would differ by 1-2 k ops/s), but now it can vary from 1277k to 1535k in single execution.

Lorak-mmk avatar Nov 25 '25 17:11 Lorak-mmk

much more volatile than 1.44.2. In <=1.44.2, each 5-second fragment would have very similar results (fragments would differ by 1-2 k ops/s), but now it can vary from 1277k to 1535k in single execution.

This variation might be caused by the extra heap allocation of the alternative timer implementation.

ADD-SP avatar Nov 26 '25 11:11 ADD-SP