neqo icon indicating copy to clipboard operation
neqo copied to clipboard

perf: use GSO (attempt 3)

Open mxinden opened this issue 8 months ago • 4 comments

Attempt 1: https://github.com/mozilla/neqo/commit/f25b0b77ff579b56a4ea882a3ca70404b3c38b03 Attempt 2: https://github.com/mozilla/neqo/pull/2532/

Compared to attempt 2:

  • implements the datagram batching in neqo-transport instead of neqo-bin
  • does not copy each datagram in the larger GSO buffer, but instead writes each into the GSO buffer right away.

mxinden avatar Apr 18 '25 13:04 mxinden

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 66be2e61c9e899e91a1c9b27b053de15c125731e.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

  • neqo-latest vs. aioquic: H DC LR :rocket:~~C20 M S~~ R :rocket:~~3~~ B :warning:U A :warning:L1 C1 :warning:C2 6 V2 :warning:BA :rocket:~~BP~~
  • neqo-latest vs. go-x-net: :rocket:~~H DC~~ LR :warning:U L2 6 :rocket:~~M B A C2~~
  • neqo-latest vs. lsquic: :warning:H DC :warning:C20 M S R Z 3 B :warning:A :rocket:~~U~~ L2 :warning:6 :rocket:~~C2~~ V2 :warning:BP BA :warning:CM
  • neqo-latest vs. mvfst: H :warning:DC LR M R :warning:Z 3 B :rocket:~~U~~ L2 :warning:BP BA :rocket:~~C2 6~~
  • neqo-latest vs. neqo-latest: :warning:H DC LR C20 :warning:M S 3 B U E :warning:L1 :rocket:~~L2~~ C2 :warning:BP CM :rocket:~~6~~
  • neqo-latest vs. nginx: :warning:H :rocket:~~LR~~ C20 :warning:M S :warning:R Z :rocket:~~3~~ B :warning:U :rocket:~~A L1~~ L2 C1 C2 6
  • neqo-latest vs. ngtcp2: H DC :warning:C20 M :warning:S R :warning:3 U E A :warning:L1 L2 C1 C2 6 V2 BP :warning:BA
  • neqo-latest vs. picoquic: :warning:H DC LR C20 M :rocket:~~S~~ R 3 U E :warning:L1 L2 C2 V2 BP BA
  • neqo-latest vs. quic-go: H DC C20 :warning:M R :rocket:~~S~~ Z :warning:3 B U L1 :rocket:~~L2~~ C1 :rocket:~~C2~~ 6 BP BA
  • neqo-latest vs. quiche: :warning:H DC :warning:LR C20 M :rocket:~~S~~ R Z :warning:3 B :warning:U A :rocket:~~L2~~ C1 C2 :warning:6
  • neqo-latest vs. quinn: H :rocket:~~DC~~ LR C20 :warning:M S R Z :rocket:~~3~~ B U :warning:E A L1 C1 :warning:C2 6 BP BA
  • neqo-latest vs. s2n-quic: H LR C20 M S :rocket:~~R~~ 3 B U E A L1 :rocket:~~L2 C1 C2~~ 6
  • neqo-latest vs. tquic: :rocket:~~DC~~ LR :warning:C20 M Z 3 :warning:U :rocket:~~B~~ L1 L2 C1 C2 6

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

github-actions[bot] avatar Apr 18 '25 13:04 github-actions[bot]

Benchmark results

Performance differences relative to 95f9bedb40bc852f5f62b611ad6b2fd22c636843.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: :green_heart: Performance has improved.
       time:   [202.13 ms 202.48 ms 202.85 ms]
       thrpt:  [492.97 MiB/s 493.86 MiB/s 494.74 MiB/s]
change:
       time:   [−69.170% −69.106% −69.043%] (p = 0.00 +223.69% +224.36%]

Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold.
       time:   [304.31 ms 305.83 ms 307.35 ms]
       thrpt:  [32.536 Kelem/s 32.698 Kelem/s 32.862 Kelem/s]
change:
       time:   [+0.6959% +1.3800% +2.0501%] (p = 0.00 −1.3612% −0.6911%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: :broken_heart: Performance has regressed.
       time:   [27.525 ms 27.597 ms 27.673 ms]
       thrpt:  [36.136  elem/s 36.236  elem/s 36.331  elem/s]
change:
       time:   [+1.1068% +1.8128% +2.4725%] (p = 0.00 −1.7805% −1.0947%]

Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: :green_heart: Performance has improved.
       time:   [648.95 ms 654.10 ms 659.22 ms]
       thrpt:  [151.69 MiB/s 152.88 MiB/s 154.10 MiB/s]
change:
       time:   [−28.772% −27.945% −27.096%] (p = 0.00 +38.782% +40.394%]

Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) low severe 4 (4.00%) low mild 2 (2.00%) high severe

decode 4096 bytes, mask ff: No change in performance detected.
       time:   [11.792 µs 11.818 µs 11.851 µs]
       change: [−0.7711% −0.1718% +0.3634%] (p = 0.57 > 0.05)

Found 15 outliers among 100 measurements (15.00%) 3 (3.00%) low severe 2 (2.00%) low mild 3 (3.00%) high mild 7 (7.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.
       time:   [3.0229 ms 3.0323 ms 3.0435 ms]
       change: [−0.2404% +0.1966% +0.6361%] (p = 0.39 > 0.05)

Found 9 outliers among 100 measurements (9.00%) 9 (9.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [19.968 µs 20.023 µs 20.082 µs]
       change: [−0.7959% −0.1788% +0.3899%] (p = 0.57 > 0.05)

Found 21 outliers among 100 measurements (21.00%) 1 (1.00%) low severe 4 (4.00%) low mild 16 (16.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [5.0371 ms 5.0487 ms 5.0618 ms]
       change: [−0.5165% −0.1114% +0.2906%] (p = 0.59 > 0.05)

Found 14 outliers among 100 measurements (14.00%) 14 (14.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [8.2722 µs 8.3105 µs 8.3530 µs]
       change: [−0.2019% +0.2442% +0.7830%] (p = 0.32 > 0.05)

Found 19 outliers among 100 measurements (19.00%) 6 (6.00%) low mild 2 (2.00%) high mild 11 (11.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.
       time:   [1.5850 ms 1.5902 ms 1.5962 ms]
       change: [−0.6684% −0.1223% +0.4020%] (p = 0.66 > 0.05)

Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe

1000 streams of 1 bytes/multistream: No change in performance detected.
       time:   [33.203 ns 39.555 ns 51.878 ns]
       change: [+10.728% +32.674% +75.025%] (p = 0.06 > 0.05)

Found 3 outliers among 500 measurements (0.60%) 1 (0.20%) high mild 2 (0.40%) high severe

1000 streams of 1000 bytes/multistream: :broken_heart: Performance has regressed.
       time:   [34.055 ns 34.490 ns 34.929 ns]
       change: [+12.649% +14.534% +16.408%] (p = 0.00 Found 1 outliers among 500 measurements (0.20%)
1 (0.20%) high severe
coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [88.115 ns 88.448 ns 88.789 ns]
       change: [−0.4490% +0.5174% +1.7914%] (p = 0.49 > 0.05)

Found 11 outliers among 100 measurements (11.00%) 7 (7.00%) high mild 4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [105.48 ns 105.73 ns 105.99 ns]
       change: [−0.8906% −0.3531% +0.1143%] (p = 0.18 > 0.05)

Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [105.05 ns 105.38 ns 105.80 ns]
       change: [−0.2958% +0.2637% +0.8589%] (p = 0.38 > 0.05)

Found 21 outliers among 100 measurements (21.00%) 4 (4.00%) low severe 6 (6.00%) low mild 3 (3.00%) high mild 8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [88.820 ns 88.971 ns 89.126 ns]
       change: [−0.7026% +0.2281% +1.1404%] (p = 0.65 > 0.05)

Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): No change in performance detected.
       time:   [107.81 ms 107.97 ms 108.23 ms]
       change: [−0.4699% −0.1075% +0.2218%] (p = 0.59 > 0.05)

Found 10 outliers among 100 measurements (10.00%) 7 (7.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe

sent::Packets::take_ranges: No change in performance detected.
       time:   [8.0612 µs 8.2610 µs 8.4441 µs]
       change: [−0.7407% +5.8984% +17.161%] (p = 0.24 > 0.05)

Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) low severe 11 (11.00%) low mild 4 (4.00%) high mild 1 (1.00%) high severe

transfer/pacing-false/varying-seeds: :broken_heart: Performance has regressed.
       time:   [37.072 ms 37.169 ms 37.279 ms]
       change: [+4.5101% +4.8981% +5.2577%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
transfer/pacing-true/varying-seeds: :broken_heart: Performance has regressed.
       time:   [37.692 ms 37.808 ms 37.931 ms]
       change: [+5.0829% +5.4940% +5.9561%] (p = 0.00 Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
transfer/pacing-false/same-seed: :broken_heart: Performance has regressed.
       time:   [36.999 ms 37.067 ms 37.140 ms]
       change: [+4.6033% +4.8770% +5.1647%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
transfer/pacing-true/same-seed: :broken_heart: Performance has regressed.
       time:   [38.372 ms 38.472 ms 38.576 ms]
       change: [+4.1365% +4.4851% +4.8031%] (p = 0.00 Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe

Client/server transfer results

Performance differences relative to 95f9bedb40bc852f5f62b611ad6b2fd22c636843.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ main Δ main
google vs. google 451.8 ± 4.7 444.9 461.2 70.8 ± 6.8
google vs. neqo (cubic, paced) 268.6 ± 4.5 261.4 283.8 119.2 ± 7.1 :green_heart: -49.9 -15.7%
msquic vs. msquic 133.0 ± 34.2 100.8 374.4 240.6 ± 0.9
msquic vs. neqo (cubic, paced) 145.8 ± 16.7 121.6 225.1 219.5 ± 1.9 :green_heart: -125.9 -46.3%
neqo vs. google (cubic, paced) 751.4 ± 4.5 743.5 769.3 42.6 ± 7.1 -0.5 -0.1%
neqo vs. msquic (cubic, paced) 155.6 ± 5.0 147.3 176.0 205.6 ± 6.4 -0.6 -0.4%
neqo vs. neqo (cubic) 90.0 ± 4.7 78.9 105.0 355.7 ± 6.8 :green_heart: -121.0 -57.4%
neqo vs. neqo (cubic, paced) 90.2 ± 4.0 82.7 99.1 354.7 ± 8.0 :green_heart: -121.0 -57.3%
neqo vs. neqo (reno) 90.8 ± 5.2 80.3 108.5 352.5 ± 6.2 :green_heart: -118.3 -56.6%
neqo vs. neqo (reno, paced) 93.2 ± 5.3 82.0 113.0 343.2 ± 6.0 :green_heart: -116.8 -55.6%
neqo vs. quiche (cubic, paced) 191.7 ± 4.2 185.4 202.1 167.0 ± 7.6 :broken_heart: 2.3 1.2%
neqo vs. s2n (cubic, paced) 217.8 ± 4.6 210.3 225.9 146.9 ± 7.0 1.1 0.5%
quiche vs. neqo (cubic, paced) 157.6 ± 5.8 146.1 183.5 203.1 ± 5.5 :green_heart: -590.4 -78.9%
quiche vs. quiche 147.0 ± 4.9 137.7 164.8 217.6 ± 6.5
s2n vs. neqo (cubic, paced) 172.1 ± 5.0 161.3 183.3 186.0 ± 6.4 :green_heart: -126.3 -42.3%
s2n vs. s2n 248.2 ± 27.7 230.3 345.1 128.9 ± 1.2

Download data for profiler.firefox.com or download performance comparison data.

github-actions[bot] avatar Apr 18 '25 13:04 github-actions[bot]

Optimized Upload only thus far.

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.

   time:   [1.2891 s 1.2983 s 1.3077 s]
   thrpt:  [76.469 MiB/s 77.023 MiB/s 77.571 MiB/s]

change: time: [-32.828% -31.716% -30.597%] (p = 0.00 < 0.05) thrpt: [+44.086% +46.447% +48.872%]

Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild

:tada: matches https://github.com/mozilla/neqo/pull/2532#issuecomment-2758283036.

mxinden avatar Apr 18 '25 14:04 mxinden

Introduced the same optimizations to neqo-server. In addition I removed the memory copy, now allocating each datagram of a GSO train into a single contiguous Vec right away. Result looks promising.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.

   time:   [245.19 ms 245.65 ms 246.12 ms]
   thrpt:  [406.30 MiB/s 407.09 MiB/s 407.85 MiB/s]

change: time: [-66.225% -66.008% -65.788%] (p = 0.00 < 0.05) thrpt: [+192.30% +194.19% +196.08%]

mxinden avatar Apr 21 '25 16:04 mxinden

Why do we see a massive benefit in the client/server tests, but not in the transfer benches?

larseggert avatar Jun 13 '25 05:06 larseggert

@larseggert the neqo-transport/bench/transfer.rs benchmarks use the test-fixtures/src/sim Simulator. The Simulator only processes a single datagram at a time.

https://github.com/mozilla/neqo/blob/37c3aeebb79aef9f9649c54c3bbfae84fee523b3/test-fixture/src/sim/mod.rs#L206

Let me see whether I can change that as part of this pull request. After all our benchmarks and tests should mirror how we run Neqo in Firefox as close as possible.

mxinden avatar Jun 13 '25 07:06 mxinden

Codecov Report

Attention: Patch coverage is 95.93810% with 21 lines in your changes missing coverage. Please review.

Project coverage is 95.56%. Comparing base (d16866a) to head (79a3266).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2593      +/-   ##
==========================================
- Coverage   95.59%   95.56%   -0.03%     
==========================================
  Files         115      115              
  Lines       37712    37956     +244     
  Branches    37712    37956     +244     
==========================================
+ Hits        36049    36271     +222     
- Misses       1657     1680      +23     
+ Partials        6        5       -1     
Components Coverage Δ
neqo-common 97.42% <92.72%> (-0.38%) :arrow_down:
neqo-crypto 90.49% <ø> (ø)
neqo-http3 94.50% <100.00%> (+0.01%) :arrow_up:
neqo-qpack 96.28% <ø> (ø)
neqo-transport 96.53% <97.97%> (-0.03%) :arrow_down:
neqo-udp 90.53% <82.85%> (-1.42%) :arrow_down:

codecov[bot] avatar Jun 19 '25 07:06 codecov[bot]

This pull request is ready for review. Note that benchmark results are inaccurate due to https://github.com/mozilla/neqo/pull/2743.

mxinden avatar Jun 19 '25 11:06 mxinden

This pull request is ready for review. Note that benchmark results are inaccurate due to #2743.

I'm merging #2743 now, so we can get fresh/correct bench data.

larseggert avatar Jun 19 '25 12:06 larseggert

Thanks for the quick review!

For the record, MacOS failure is due to different EMSGSIZE handling in quinn-udp (https://github.com/quinn-rs/quinn/pull/2199). Will push a simple patch.

mxinden avatar Jun 19 '25 14:06 mxinden

🐰 Bencher Report

Branchgso-v3
Testbedt-linux64-ms-280

🚨 6 Alerts

BenchmarkMeasure
Units
ViewBenchmark Result
(Result Δ%)
Upper Boundary
(Limit %)
coalesce_acked_from_zero 10+1 entriesLatency
nanoseconds (ns)
📈 plot
🚷 threshold
🚨 alert (🔔)
107.63 ns
(+1.59%)Baseline: 105.94 ns
106.97 ns
(100.62%)

coalesce_acked_from_zero 1000+1 entriesLatency
nanoseconds (ns)
📈 plot
🚷 threshold
🚨 alert (🔔)
95.73 ns
(+7.14%)Baseline: 89.35 ns
92.15 ns
(103.88%)

transfer/pacing-false/same-seedLatency
milliseconds (ms)
📈 plot
🚷 threshold
🚨 alert (🔔)
36.89 ms
(+5.37%)Baseline: 35.01 ms
36.81 ms
(100.21%)

transfer/pacing-false/varying-seedsLatency
milliseconds (ms)
📈 plot
🚷 threshold
🚨 alert (🔔)
37.28 ms
(+6.05%)Baseline: 35.15 ms
36.99 ms
(100.78%)

transfer/pacing-true/same-seedLatency
milliseconds (ms)
📈 plot
🚷 threshold
🚨 alert (🔔)
38.66 ms
(+5.61%)Baseline: 36.61 ms
38.32 ms
(100.88%)

transfer/pacing-true/varying-seedsLatency
milliseconds (ms)
📈 plot
🚷 threshold
🚨 alert (🔔)
38.16 ms
(+6.04%)Baseline: 35.98 ms
37.73 ms
(101.12%)

Click to view all benchmark results
BenchmarkLatencyBenchmark Result
nanoseconds (ns)
(Result Δ%)
Upper Boundary
nanoseconds (ns)
(Limit %)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client📈 view plot
🚷 view threshold
640,320,000.00 ns
(-3.54%)Baseline: 663,820,375.00 ns
728,934,998.14 ns
(87.84%)
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client📈 view plot
🚷 view threshold
200,420,000.00 ns
(-68.10%)Baseline: 628,188,375.00 ns
852,579,930.08 ns
(23.51%)
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client📈 view plot
🚷 view threshold
27,471,000.00 ns
(+1.07%)Baseline: 27,180,837.50 ns
27,655,811.64 ns
(99.33%)
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client📈 view plot
🚷 view threshold
309,930,000.00 ns
(+1.66%)Baseline: 304,879,875.00 ns
315,750,492.21 ns
(98.16%)
1000 streams of 1 bytes/multistream📈 view plot
🚷 view threshold
38.32 ns
(+3.24%)Baseline: 37.11 ns
53.66 ns
(71.41%)
1000 streams of 1000 bytes/multistream📈 view plot
🚷 view threshold
38.63 ns
(+5.26%)Baseline: 36.70 ns
53.25 ns
(72.55%)
RxStreamOrderer::inbound_frame()📈 view plot
🚷 view threshold
109,970,000.00 ns
(-0.47%)Baseline: 110,483,762.50 ns
114,400,196.45 ns
(96.13%)
coalesce_acked_from_zero 1+1 entries📈 view plot
🚷 view threshold
88.51 ns
(-0.19%)Baseline: 88.68 ns
89.29 ns
(99.13%)
coalesce_acked_from_zero 10+1 entries📈 view plot
🚷 view threshold
🚨 view alert (🔔)
107.63 ns
(+1.59%)Baseline: 105.94 ns
106.97 ns
(100.62%)

coalesce_acked_from_zero 1000+1 entries📈 view plot
🚷 view threshold
🚨 view alert (🔔)
95.73 ns
(+7.14%)Baseline: 89.35 ns
92.15 ns
(103.88%)

coalesce_acked_from_zero 3+1 entries📈 view plot
🚷 view threshold
106.31 ns
(-0.19%)Baseline: 106.51 ns
107.36 ns
(99.02%)
decode 1048576 bytes, mask 3f📈 view plot
🚷 view threshold
1,596,600.00 ns
(-1.11%)Baseline: 1,614,482.50 ns
1,757,188.61 ns
(90.86%)
decode 1048576 bytes, mask 7f📈 view plot
🚷 view threshold
5,060,900.00 ns
(-0.05%)Baseline: 5,063,310.00 ns
5,089,343.37 ns
(99.44%)
decode 1048576 bytes, mask ff📈 view plot
🚷 view threshold
3,031,600.00 ns
(-0.12%)Baseline: 3,035,103.75 ns
3,066,146.08 ns
(98.87%)
decode 4096 bytes, mask 3f📈 view plot
🚷 view threshold
8,273.50 ns
(+4.04%)Baseline: 7,952.18 ns
10,113.99 ns
(81.80%)
decode 4096 bytes, mask 7f📈 view plot
🚷 view threshold
20,017.00 ns
(+0.46%)Baseline: 19,924.92 ns
20,394.59 ns
(98.15%)
decode 4096 bytes, mask ff📈 view plot
🚷 view threshold
11,841.00 ns
(+0.20%)Baseline: 11,817.16 ns
11,970.83 ns
(98.92%)
sent::Packets::take_ranges📈 view plot
🚷 view threshold
8,447.90 ns
(+0.24%)Baseline: 8,427.93 ns
8,481.15 ns
(99.61%)
transfer/pacing-false/same-seed📈 view plot
🚷 view threshold
🚨 view alert (🔔)
36,886,000.00 ns
(+5.37%)Baseline: 35,006,825.00 ns
36,809,827.13 ns
(100.21%)

transfer/pacing-false/varying-seeds📈 view plot
🚷 view threshold
🚨 view alert (🔔)
37,278,000.00 ns
(+6.05%)Baseline: 35,150,025.00 ns
36,990,430.22 ns
(100.78%)

transfer/pacing-true/same-seed📈 view plot
🚷 view threshold
🚨 view alert (🔔)
38,661,000.00 ns
(+5.61%)Baseline: 36,608,950.00 ns
38,323,633.10 ns
(100.88%)

transfer/pacing-true/varying-seeds📈 view plot
🚷 view threshold
🚨 view alert (🔔)
38,156,000.00 ns
(+6.04%)Baseline: 35,982,850.00 ns
37,732,380.19 ns
(101.12%)

🐰 View full continuous benchmarking report in Bencher

github-actions[bot] avatar Jun 19 '25 16:06 github-actions[bot]

For the record, MacOS failure is due to different EMSGSIZE handling in quinn-udp (quinn-rs/quinn#2199). Will push a simple patch.

Will be fixed in https://github.com/mozilla/neqo/pull/2746.

mxinden avatar Jun 19 '25 18:06 mxinden

🐰 Bencher Report

Branchgso-v3
Testbedt-linux64-ms-280
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
s2n vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
215.73 ms
(-30.65%)Baseline: 311.08 ms
349.46 ms
(61.73%)
🐰 View full continuous benchmarking report in Bencher

github-actions[bot] avatar Jun 19 '25 20:06 github-actions[bot]

@mxinden I fixed up the doc comments, but there are still Windows test failures.

larseggert avatar Jun 25 '25 08:06 larseggert

@mxinden tests::send_ignore_emsgsize still failing on Windows.

larseggert avatar Jun 27 '25 09:06 larseggert

🐰 Bencher Report

Branchgso-v3
Testbedt-linux64-ms-279
Click to view all benchmark results
BenchmarkLatencynanoseconds (ns)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client📈 view plot
🚷 view threshold
646,670,000.00 ns
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client📈 view plot
🚷 view threshold
201,340,000.00 ns
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client📈 view plot
🚷 view threshold
27,380,000.00 ns
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client📈 view plot
🚷 view threshold
307,020,000.00 ns
1000 streams of 1 bytes/multistream📈 view plot
🚷 view threshold
34.99 ns
1000 streams of 1000 bytes/multistream📈 view plot
🚷 view threshold
35.03 ns
RxStreamOrderer::inbound_frame()📈 view plot
🚷 view threshold
110,960,000.00 ns
coalesce_acked_from_zero 1+1 entries📈 view plot
🚷 view threshold
88.31 ns
coalesce_acked_from_zero 10+1 entries📈 view plot
🚷 view threshold
105.52 ns
coalesce_acked_from_zero 1000+1 entries📈 view plot
🚷 view threshold
90.91 ns
coalesce_acked_from_zero 3+1 entries📈 view plot
🚷 view threshold
105.85 ns
decode 1048576 bytes, mask 3f📈 view plot
🚷 view threshold
1,590,700.00 ns
decode 1048576 bytes, mask 7f📈 view plot
🚷 view threshold
5,047,400.00 ns
decode 1048576 bytes, mask ff📈 view plot
🚷 view threshold
3,031,800.00 ns
decode 4096 bytes, mask 3f📈 view plot
🚷 view threshold
8,308.50 ns
decode 4096 bytes, mask 7f📈 view plot
🚷 view threshold
20,011.00 ns
decode 4096 bytes, mask ff📈 view plot
🚷 view threshold
11,832.00 ns
sent::Packets::take_ranges📈 view plot
🚷 view threshold
5,182.40 ns
transfer/pacing-false/same-seed📈 view plot
🚷 view threshold
36,846,000.00 ns
transfer/pacing-false/varying-seeds📈 view plot
🚷 view threshold
37,089,000.00 ns
transfer/pacing-true/same-seed📈 view plot
🚷 view threshold
38,620,000.00 ns
transfer/pacing-true/varying-seeds📈 view plot
🚷 view threshold
38,194,000.00 ns
🐰 View full continuous benchmarking report in Bencher

github-actions[bot] avatar Jun 27 '25 09:06 github-actions[bot]

🐰 Bencher Report

Branchgso-v3
Testbedt-linux64-ms-278
Click to view all benchmark results
BenchmarkLatencynanoseconds (ns)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client📈 view plot
🚷 view threshold
654,100,000.00 ns
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client📈 view plot
🚷 view threshold
202,480,000.00 ns
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client📈 view plot
🚷 view threshold
27,597,000.00 ns
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client📈 view plot
🚷 view threshold
305,830,000.00 ns
1000 streams of 1 bytes/multistream📈 view plot
🚷 view threshold
39.55 ns
1000 streams of 1000 bytes/multistream📈 view plot
🚷 view threshold
34.49 ns
RxStreamOrderer::inbound_frame()📈 view plot
🚷 view threshold
107,970,000.00 ns
coalesce_acked_from_zero 1+1 entries📈 view plot
🚷 view threshold
88.45 ns
coalesce_acked_from_zero 10+1 entries📈 view plot
🚷 view threshold
105.38 ns
coalesce_acked_from_zero 1000+1 entries📈 view plot
🚷 view threshold
88.97 ns
coalesce_acked_from_zero 3+1 entries📈 view plot
🚷 view threshold
105.73 ns
decode 1048576 bytes, mask 3f📈 view plot
🚷 view threshold
1,590,200.00 ns
decode 1048576 bytes, mask 7f📈 view plot
🚷 view threshold
5,048,700.00 ns
decode 1048576 bytes, mask ff📈 view plot
🚷 view threshold
3,032,300.00 ns
decode 4096 bytes, mask 3f📈 view plot
🚷 view threshold
8,310.50 ns
decode 4096 bytes, mask 7f📈 view plot
🚷 view threshold
20,023.00 ns
decode 4096 bytes, mask ff📈 view plot
🚷 view threshold
11,818.00 ns
sent::Packets::take_ranges📈 view plot
🚷 view threshold
8,261.00 ns
transfer/pacing-false/same-seed📈 view plot
🚷 view threshold
37,067,000.00 ns
transfer/pacing-false/varying-seeds📈 view plot
🚷 view threshold
37,169,000.00 ns
transfer/pacing-true/same-seed📈 view plot
🚷 view threshold
38,472,000.00 ns
transfer/pacing-true/varying-seeds📈 view plot
🚷 view threshold
37,808,000.00 ns
🐰 View full continuous benchmarking report in Bencher

github-actions[bot] avatar Jun 30 '25 12:06 github-actions[bot]

@mxinden is this ready to merge?

larseggert avatar Jun 30 '25 13:06 larseggert

Yes, ready to merge from my end. We have a couple of benchmark regressions. Explainer for each:

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.

   time:   [650.03 ms 655.28 ms 660.94 ms]
   thrpt:  [151.30 MiB/s 152.61 MiB/s 153.84 MiB/s]

change: time: [−27.566% −26.708% −25.736%] (p = 0.00 < 0.05) thrpt: [+34.655% +36.441% +38.056%]

This will improve even further with https://github.com/mozilla/neqo/pull/2734.

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.

   time:   [27.404 ms 27.500 ms 27.616 ms]
   thrpt:  [36.211  elem/s 36.363  elem/s 36.491  elem/s]

change: time: [+1.5705% +2.1502% +2.7519%] (p = 0.00 < 0.05) thrpt: [−2.6782% −2.1049% −1.5463%]

This is expected. We pay a slight cost in latency when sending in batches.

1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.

   time:   [36.454 ns 36.834 ns 37.215 ns]
   change: [+25.596% +27.533% +29.527%] (p = 0.00 < 0.05)

This should be due to neqo-http3/benches/streams.rs not using the batched IO paths. Instead of altering the IO handling in the benchmark, I suggest we do https://github.com/mozilla/neqo/issues/2728. Given that the benchmark measures stream performance and not UDP IO performance, I suggest doing this in a follow-up.

transfer/pacing-false/varying-seeds: 💔 Performance has regressed.

   time:   [36.886 ms 36.956 ms 37.027 ms]
   change: [+4.0332% +4.3753% +4.6740%] (p = 0.00 < 0.05)

Again, slight regression as the Simulator is not using the batched IO paths. The non-batched IO path (i.e. process), now no-longer pre-allocate, as we don't know the datagram size ahead of time. Once https://github.com/mozilla/neqo/pull/2747 is merged, this overhead should be reduced, as we would write datagrams into a long-lived buffer.

@larseggert let me know whether you are fine proceeding here, or would prefer any of the above to be addressed first.

mxinden avatar Jun 30 '25 14:06 mxinden

I'll merge now; please do issues for the missing bits?

Great we can land this!

larseggert avatar Jun 30 '25 16:06 larseggert

This keeps getting kicked out of the merge queue while tests are still running and haven't failed yet. I think GitHub may have issues. Doing a force merge.

larseggert avatar Jul 01 '25 08:07 larseggert

please do issues for the missing bits?

I assume you are fine with the following pull requests tracking the progress. Let me know if you want additional GitHub issues.

  • https://github.com/mozilla/neqo/pull/2734
  • https://github.com/mozilla/neqo/issues/2728
  • https://github.com/mozilla/neqo/pull/2747

mxinden avatar Jul 06 '25 17:07 mxinden

Early numbers on GSO in Firefox Nightly:

  • ~5% of sends on Linux and Windows use GSO with 2 or more segments
  • ~5% of sends on Linux and Windows send 2.4 k bytes or more
  • We currently limit number of segments to 10, which is reflected in the metrics (apart from some crazy machine on Linux doing > 100)

Good signals. We should explore increasing max number of segments (currently 10). Maybe just limit by what our pacer allows to send.

Datagram (batch) size

Windows

image

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Windows&visiblePercentiles=%5B99%2C95%2C75%2C50%2C25%2C5%5D

Linux

image

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Number of segments in a batch

Windows

image

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Windows&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Linux

image

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

mxinden avatar Jul 22 '25 10:07 mxinden

Yes, let's increase.

larseggert avatar Jul 22 '25 11:07 larseggert