neqo icon indicating copy to clipboard operation
neqo copied to clipboard

feat: Reduce reallocations in RxStreamOrderer

Open larseggert opened this issue 3 months ago • 9 comments

Let's see if this helps performance.

larseggert avatar Sep 22 '25 07:09 larseggert

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 93.36%. Comparing base (b3d8f0d) to head (cc5c529).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3003      +/-   ##
==========================================
- Coverage   95.66%   93.36%   -2.31%     
==========================================
  Files         123      123              
  Lines       35702    35712      +10     
  Branches    35702    35712      +10     
==========================================
- Hits        34156    33342     -814     
- Misses       1506     1528      +22     
- Partials       40      842     +802     
Components Coverage Δ
neqo-common 97.31% <ø> (-0.88%) :arrow_down:
neqo-crypto 83.31% <ø> (-7.17%) :arrow_down:
neqo-http3 93.32% <ø> (-1.81%) :arrow_down:
neqo-qpack 94.14% <ø> (-2.09%) :arrow_down:
neqo-transport 94.44% <100.00%> (-2.14%) :arrow_down:
neqo-udp 80.48% <ø> (-10.74%) :arrow_down:
mtu 85.76% <ø> (-1.74%) :arrow_down:

codecov[bot] avatar Sep 22 '25 07:09 codecov[bot]

🐰 Bencher Report

Branchfeat-inbound_frame-prealloc
TestbedOn-prem
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
google vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
278.12 ms
(-0.08%)Baseline: 278.34 ms
282.73 ms
(98.37%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
msquic vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
224.72 ms
(+12.76%)Baseline: 199.30 ms
236.94 ms
(94.84%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. google (cubic, paced)📈 view plot
🚷 view threshold
756.26 ms
(-0.45%)Baseline: 759.69 ms
774.82 ms
(97.61%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. msquic (cubic, paced)📈 view plot
🚷 view threshold
156.46 ms
(-0.83%)Baseline: 157.78 ms
160.59 ms
(97.43%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (cubic)📈 view plot
🚷 view threshold
94.69 ms
(+3.42%)Baseline: 91.56 ms
96.88 ms
(97.74%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
94.16 ms
(+1.35%)Baseline: 92.90 ms
98.09 ms
(95.99%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (reno)📈 view plot
🚷 view threshold
93.24 ms
(+1.86%)Baseline: 91.54 ms
96.70 ms
(96.43%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (reno, paced)📈 view plot
🚷 view threshold
95.04 ms
(+2.42%)Baseline: 92.79 ms
97.78 ms
(97.19%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. quiche (cubic, paced)📈 view plot
🚷 view threshold
191.75 ms
(-0.97%)Baseline: 193.64 ms
196.97 ms
(97.35%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. s2n (cubic, paced)📈 view plot
🚷 view threshold
221.72 ms
(+0.26%)Baseline: 221.14 ms
224.10 ms
(98.94%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
quiche vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
157.33 ms
(+2.74%)Baseline: 153.14 ms
158.50 ms
(99.26%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
s2n vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
173.28 ms
(-0.28%)Baseline: 173.77 ms
178.00 ms
(97.35%)
🐰 View full continuous benchmarking report in Bencher

github-actions[bot] avatar Sep 22 '25 08:09 github-actions[bot]

Benchmark results

Performance differences relative to b3d8f0d21db5657ebafb14f9b60c8892d2eb3aa9.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: No change in performance detected.
       time:   [200.11 ms 200.47 ms 200.96 ms]
       thrpt:  [497.61 MiB/s 498.82 MiB/s 499.72 MiB/s]
change:
       time:   [−0.0850% +0.1485% +0.4403%] (p = 0.28 > 0.05)
       thrpt:  [−0.4384% −0.1483% +0.0851%]

Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: No change in performance detected.
       time:   [299.66 ms 301.36 ms 303.08 ms]
       thrpt:  [32.994 Kelem/s 33.183 Kelem/s 33.371 Kelem/s]
change:
       time:   [−0.3116% +0.4839% +1.2104%] (p = 0.21 > 0.05)
       thrpt:  [−1.1959% −0.4816% +0.3126%]

Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.
       time:   [28.416 ms 28.512 ms 28.630 ms]
       thrpt:  [34.928   B/s 35.073   B/s 35.191   B/s]
change:
       time:   [−0.3750% +0.0906% +0.5737%] (p = 0.71 > 0.05)
       thrpt:  [−0.5704% −0.0905% +0.3764%]

Found 23 outliers among 100 measurements (23.00%) 11 (11.00%) low severe 1 (1.00%) high mild 11 (11.00%) high severe

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: :green_heart: Performance has improved.
       time:   [202.29 ms 202.62 ms 203.01 ms]
       thrpt:  [492.58 MiB/s 493.52 MiB/s 494.33 MiB/s]
change:
       time:   [−3.7342% −3.4882% −3.2606%] (p = 0.00 +3.6142% +3.8791%]

Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe

decode 4096 bytes, mask ff: No change in performance detected.
       time:   [11.613 µs 11.651 µs 11.694 µs]
       change: [−0.8027% −0.1804% +0.3326%] (p = 0.57 > 0.05)

Found 18 outliers among 100 measurements (18.00%) 2 (2.00%) low severe 6 (6.00%) low mild 3 (3.00%) high mild 7 (7.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.
       time:   [3.0185 ms 3.0278 ms 3.0387 ms]
       change: [−0.8118% −0.1895% +0.3544%] (p = 0.54 > 0.05)

Found 8 outliers among 100 measurements (8.00%) 8 (8.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [19.948 µs 19.998 µs 20.056 µs]
       change: [−0.3160% +0.1413% +0.5881%] (p = 0.57 > 0.05)

Found 17 outliers among 100 measurements (17.00%) 1 (1.00%) low severe 1 (1.00%) high mild 15 (15.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [5.0328 ms 5.0426 ms 5.0540 ms]
       change: [−1.2024% −0.5035% +0.0438%] (p = 0.12 > 0.05)

Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low mild 10 (10.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [8.2789 µs 8.3159 µs 8.3580 µs]
       change: [+0.0483% +0.5635% +1.1728%] (p = 0.06 > 0.05)

Found 25 outliers among 100 measurements (25.00%) 2 (2.00%) low severe 8 (8.00%) low mild 3 (3.00%) high mild 12 (12.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.
       time:   [1.5881 ms 1.5949 ms 1.6035 ms]
       change: [−2.0388% −0.4004% +0.7757%] (p = 0.67 > 0.05)

Found 11 outliers among 100 measurements (11.00%) 3 (3.00%) high mild 8 (8.00%) high severe

1-streams/each-1000-bytes/wallclock-time: Change within noise threshold.
       time:   [589.89 µs 591.70 µs 593.79 µs]
       change: [−1.1506% −0.6516% −0.1492%] (p = 0.01 Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high severe
1-streams/each-1000-bytes/simulated-time
time:   [118.79 ms 118.98 ms 119.17 ms]
thrpt:  [8.1944 KiB/s 8.2076 KiB/s 8.2211 KiB/s]
change:
time:   [−0.2819% −0.0113% +0.2549%] (p = 0.93 > 0.05)
thrpt:  [−0.2543% +0.0113% +0.2827%]
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
1000-streams/each-1-bytes/wallclock-time: Change within noise threshold.
       time:   [14.032 ms 14.061 ms 14.092 ms]
       change: [−0.9199% −0.6459% −0.3527%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
1000-streams/each-1-bytes/simulated-time
time:   [14.985 s 15.000 s 15.014 s]
thrpt:  [66.605   B/s 66.668   B/s 66.732   B/s]
change:
time:   [−0.1109% +0.0187% +0.1448%] (p = 0.78 > 0.05)
thrpt:  [−0.1446% −0.0187% +0.1110%]
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1000-streams/each-1000-bytes/wallclock-time: No change in performance detected.
       time:   [50.776 ms 50.939 ms 51.101 ms]
       change: [−0.1426% +0.5306% +1.1027%] (p = 0.10 > 0.05)
1000-streams/each-1000-bytes/simulated-time: No change in performance detected.
       time:   [18.741 s 18.912 s 19.085 s]
       thrpt:  [51.170 KiB/s 51.638 KiB/s 52.110 KiB/s]
change:
       time:   [−0.9411% +0.3276% +1.5536%] (p = 0.62 > 0.05)
       thrpt:  [−1.5298% −0.3265% +0.9501%]

Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [88.191 ns 88.538 ns 88.895 ns]
       change: [−0.1959% +0.3684% +1.1270%] (p = 0.28 > 0.05)

Found 10 outliers among 100 measurements (10.00%) 6 (6.00%) high mild 4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [106.03 ns 106.55 ns 107.30 ns]
       change: [−0.2849% +0.3210% +1.0704%] (p = 0.39 > 0.05)

Found 9 outliers among 100 measurements (9.00%) 9 (9.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [105.58 ns 106.07 ns 106.64 ns]
       change: [−0.4911% −0.0224% +0.4220%] (p = 0.92 > 0.05)

Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) low severe 3 (3.00%) low mild 1 (1.00%) high mild 7 (7.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [88.723 ns 91.556 ns 98.100 ns]
       change: [−1.4101% +2.7728% +10.065%] (p = 0.60 > 0.05)

Found 14 outliers among 100 measurements (14.00%) 6 (6.00%) high mild 8 (8.00%) high severe

RxStreamOrderer::inbound_frame(): :green_heart: Performance has improved.
       time:   [102.63 ms 102.79 ms 103.07 ms]
       change: [−7.6514% −7.3735% −7.0581%] (p = 0.00 Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
sent::Packets::take_ranges: No change in performance detected.
       time:   [4.5298 µs 4.6645 µs 4.8068 µs]
       change: [−2.0069% +1.3391% +5.2122%] (p = 0.52 > 0.05)

Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe

transfer/pacing-false/varying-seeds/wallclock-time/run: Change within noise threshold.
       time:   [26.951 ms 27.000 ms 27.052 ms]
       change: [+0.9397% +1.2158% +1.4901%] (p = 0.00 Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
transfer/pacing-false/varying-seeds/simulated-time/run: No change in performance detected.
       time:   [25.129 s 25.166 s 25.204 s]
       thrpt:  [162.52 KiB/s 162.76 KiB/s 163.00 KiB/s]
change:
       time:   [−0.2319% −0.0321% +0.1692%] (p = 0.76 > 0.05)
       thrpt:  [−0.1689% +0.0321% +0.2325%]

Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high mild

transfer/pacing-true/varying-seeds/wallclock-time/run: Change within noise threshold.
       time:   [27.309 ms 27.379 ms 27.452 ms]
       change: [+0.4884% +0.8607% +1.2416%] (p = 0.00 Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
transfer/pacing-true/varying-seeds/simulated-time/run: Change within noise threshold.
       time:   [24.987 s 25.033 s 25.079 s]
       thrpt:  [163.32 KiB/s 163.62 KiB/s 163.92 KiB/s]
change:
       time:   [+0.0800% +0.3075% +0.5500%] (p = 0.01 −0.3065% −0.0799%]
transfer/pacing-false/same-seed/wallclock-time/run: Change within noise threshold.
       time:   [26.419 ms 26.434 ms 26.449 ms]
       change: [+0.9941% +1.1013% +1.2043%] (p = 0.00 Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
transfer/pacing-false/same-seed/simulated-time/run: No change in performance detected.
       time:   [25.152 s 25.152 s 25.152 s]
       thrpt:  [162.85 KiB/s 162.85 KiB/s 162.85 KiB/s]
change:
       time:   [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
       thrpt:  [+0.0000% +0.0000% +0.0000%]
transfer/pacing-true/same-seed/wallclock-time/run: Change within noise threshold.
       time:   [28.142 ms 28.160 ms 28.179 ms]
       change: [+0.0679% +0.1823% +0.2954%] (p = 0.00 Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
transfer/pacing-true/same-seed/simulated-time/run: No change in performance detected.
       time:   [25.588 s 25.588 s 25.588 s]
       thrpt:  [160.07 KiB/s 160.07 KiB/s 160.07 KiB/s]
change:
       time:   [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
       thrpt:  [+0.0000% +0.0000% +0.0000%]

Download data for profiler.firefox.com or download performance comparison data.

github-actions[bot] avatar Sep 22 '25 08:09 github-actions[bot]

Hm. Transfer test regression, but one bench shows an improvement. Time to look at flamegraphs...

larseggert avatar Sep 22 '25 08:09 larseggert

Hm. Simplifying inbound_frame drastically doesn't seem to make things slower.

larseggert avatar Sep 23 '25 14:09 larseggert

So what we save in memcpy we spend on malloc now :-)

larseggert avatar Sep 23 '25 14:09 larseggert

CodSpeed Performance Report

Merging #3003 will improve performances by 17.7%

Comparing larseggert:feat-inbound_frame-prealloc (102a412) with main (b9c32c7)

Summary

⚡ 1 improvement
✅ 22 untouched

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
Simulation client 852.3 ms 724.1 ms +17.7%

codspeed-hq[bot] avatar Nov 12 '25 07:11 codspeed-hq[bot]

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to b9c32c70e273cd89f25d7f0561e01a083b8bdf03.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

github-actions[bot] avatar Nov 12 '25 07:11 github-actions[bot]

Client/server transfer results

Performance differences relative to b9c32c70e273cd89f25d7f0561e01a083b8bdf03.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ main Δ main
google vs. google 455.4 ± 4.4 450.0 466.7 70.3 ± 7.3
google vs. neqo (cubic, paced) 278.1 ± 4.5 268.8 286.8 115.1 ± 7.1 1.2 0.4%
msquic vs. msquic 187.8 ± 63.8 143.9 407.4 170.4 ± 0.5
msquic vs. neqo (cubic, paced) 224.7 ± 60.4 159.9 394.8 142.4 ± 0.5 10.1 4.7%
neqo vs. google (cubic, paced) 756.3 ± 4.2 750.2 770.4 42.3 ± 7.6 -0.3 -0.0%
neqo vs. msquic (cubic, paced) 156.5 ± 4.7 149.4 173.0 204.5 ± 6.8 -0.9 -0.6%
neqo vs. neqo (cubic) 94.7 ± 4.8 85.2 107.6 337.9 ± 6.7 1.3 1.4%
neqo vs. neqo (cubic, paced) 94.2 ± 4.3 85.9 103.4 339.9 ± 7.4 0.1 0.1%
neqo vs. neqo (reno) 93.2 ± 4.6 85.7 102.7 343.2 ± 7.0 -0.9 -1.0%
neqo vs. neqo (reno, paced) 95.0 ± 4.3 88.2 105.4 336.7 ± 7.4 0.1 0.1%
neqo vs. quiche (cubic, paced) 191.7 ± 4.2 186.1 203.0 166.9 ± 7.6 :green_heart: -3.1 -1.6%
neqo vs. s2n (cubic, paced) 221.7 ± 4.6 213.5 234.4 144.3 ± 7.0 :broken_heart: 1.4 0.6%
quiche vs. neqo (cubic, paced) 157.3 ± 4.8 145.7 170.4 203.4 ± 6.7 0.3 0.2%
quiche vs. quiche 145.3 ± 4.3 138.9 159.7 220.2 ± 7.4
s2n vs. neqo (cubic, paced) 173.3 ± 4.8 162.7 180.7 184.7 ± 6.7 0.8 0.4%
s2n vs. s2n 251.9 ± 28.5 232.0 350.0 127.0 ± 1.1

Download data for profiler.firefox.com or download performance comparison data.

github-actions[bot] avatar Nov 12 '25 09:11 github-actions[bot]