neqo perf: use GSO (attempt 3)

Attempt 1: https://github.com/mozilla/neqo/commit/f25b0b77ff579b56a4ea882a3ca70404b3c38b03 Attempt 2: https://github.com/mozilla/neqo/pull/2532/

Compared to attempt 2:

implements the datagram batching in neqo-transport instead of neqo-bin
does not copy each datagram in the larger GSO buffer, but instead writes each into the GSO buffer right away.

Apr 18 '25 13:04 mxinden

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 66be2e61c9e899e91a1c9b27b053de15c125731e.

neqo-latest as client

neqo-latest vs. aioquic: :rocket:~~C20 M S~~ Z :rocket:~~3~~ :warning:U L1 L2 :rocket:~~BP~~ :warning:C2 BA
neqo-latest vs. go-x-net: :rocket:~~H DC M B A C2~~ :warning:U L2 6 BP BA
neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2 BP BA
neqo-latest vs. kwik: :warning:H DC LR C20 M S R :warning:Z 3 B U :warning:A L1 :warning:L2 C1 :warning:C2 6 V2 BP BA
neqo-latest vs. linuxquic: H DC :warning:LR C20 M S :warning:R Z 3 B U E A L1 L2 C1 :warning:C2 6 V2 :warning:BP BA :warning:CM
neqo-latest vs. lsquic: :warning:H LR :rocket:~~U~~ :warning:C20 E :warning:A L1 C1 :rocket:~~C2~~ :warning:6 BP CM
neqo-latest vs. msquic: H :warning:DC LR C20 M S :warning:R Z B :warning:U A L1 :warning:L2 C1 C2 :warning:6 V2 BP BA
neqo-latest vs. mvfst: :rocket:~~U~~ :warning:DC Z 3 A L1 C1 :rocket:~~C2 6~~ :warning:BP BA
neqo-latest vs. neqo: run cancelled after 20 min
neqo-latest vs. neqo-latest: :warning:H DC LR M S R Z :warning:3 A :rocket:~~L2~~ :warning:L1 C1 :rocket:~~6~~ V2 :warning:BP BA :warning:CM
neqo-latest vs. nginx: :warning:H DC :rocket:~~LR 3 A L1~~ :warning:M R U BP BA
neqo-latest vs. ngtcp2: LR :warning:C20 S Z :warning:3 B :warning:L1 BA CM
neqo-latest vs. picoquic: :rocket:~~S~~ :warning:H DC Z B A :warning:L1 C1 6
neqo-latest vs. quic-go: LR :rocket:~~S~~ :warning:M R 3 B A :rocket:~~L2 C2~~
neqo-latest vs. quiche: :rocket:~~S~~ :warning:H LR C20 3 U A L1 :rocket:~~L2~~ :warning:6 BP BA
neqo-latest vs. quinn: :rocket:~~DC 3~~ :warning:M S R E L2 :warning:C2
neqo-latest vs. s2n-quic: DC :rocket:~~R L2 C1 C2~~ BP BA CM
neqo-latest vs. tquic: H :rocket:~~DC~~ :warning:C20 S R :rocket:~~B~~ :warning:U A BP BA
neqo-latest vs. xquic: H :warning:DC LR C20 :warning:M R Z 3 B U A L1 :warning:L2 C1 :warning:C2 6 BP :warning:BA

neqo-latest as server

aioquic vs. neqo-latest: run cancelled after 20 min
go-x-net vs. neqo-latest: :rocket:~~H~~ :warning:L2 CM
kwik vs. neqo-latest: :rocket:~~H DC S B L1 L2 C2 V2~~ :warning:C20 3 6 BP BA :warning:CM
linuxquic vs. neqo-latest: run cancelled after 20 min
lsquic vs. neqo-latest: :rocket:~~H C2~~ :warning:DC V2
msquic vs. neqo-latest: :rocket:~~B~~ :warning:S Z U :warning:L1 C1 V2 :rocket:~~CM~~
mvfst vs. neqo-latest: :rocket:~~H DC~~ :warning:LR Z A L1 :warning:L2 C1 :warning:CM
neqo vs. neqo-latest: run cancelled after 20 min
ngtcp2 vs. neqo-latest: :rocket:~~3 6 V2~~ :warning:LR C20 B L2
openssl vs. neqo-latest: LR C20 M :rocket:~~S R 3~~ :warning:B A :warning:BP CM
picoquic vs. neqo-latest: run cancelled after 20 min
quic-go vs. neqo-latest: :rocket:~~LR~~ :warning:3 B C1 6 CM
quiche vs. neqo-latest: :rocket:~~M B~~ :warning:R Z L2 C1 :rocket:~~6~~ :warning:BP BA CM
quinn vs. neqo-latest: :rocket:~~H DC M U L2~~ :warning:B L1 C1 V2 :warning:BA CM
s2n-quic vs. neqo-latest: :warning:H DC LR M S R 3 B A L2 6 BA CM
tquic vs. neqo-latest: run cancelled after 20 min
xquic vs. neqo-latest: run cancelled after 20 min

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: H DC LR :rocket:~~C20 M S~~ R :rocket:~~3~~ B :warning:U A :warning:L1 C1 :warning:C2 6 V2 :warning:BA :rocket:~~BP~~
neqo-latest vs. go-x-net: :rocket:~~H DC~~ LR :warning:U L2 6 :rocket:~~M B A C2~~
neqo-latest vs. lsquic: :warning:H DC :warning:C20 M S R Z 3 B :warning:A :rocket:~~U~~ L2 :warning:6 :rocket:~~C2~~ V2 :warning:BP BA :warning:CM
neqo-latest vs. mvfst: H :warning:DC LR M R :warning:Z 3 B :rocket:~~U~~ L2 :warning:BP BA :rocket:~~C2 6~~
neqo-latest vs. neqo-latest: :warning:H DC LR C20 :warning:M S 3 B U E :warning:L1 :rocket:~~L2~~ C2 :warning:BP CM :rocket:~~6~~
neqo-latest vs. nginx: :warning:H :rocket:~~LR~~ C20 :warning:M S :warning:R Z :rocket:~~3~~ B :warning:U :rocket:~~A L1~~ L2 C1 C2 6
neqo-latest vs. ngtcp2: H DC :warning:C20 M :warning:S R :warning:3 U E A :warning:L1 L2 C1 C2 6 V2 BP :warning:BA
neqo-latest vs. picoquic: :warning:H DC LR C20 M :rocket:~~S~~ R 3 U E :warning:L1 L2 C2 V2 BP BA
neqo-latest vs. quic-go: H DC C20 :warning:M R :rocket:~~S~~ Z :warning:3 B U L1 :rocket:~~L2~~ C1 :rocket:~~C2~~ 6 BP BA
neqo-latest vs. quiche: :warning:H DC :warning:LR C20 M :rocket:~~S~~ R Z :warning:3 B :warning:U A :rocket:~~L2~~ C1 C2 :warning:6
neqo-latest vs. quinn: H :rocket:~~DC~~ LR C20 :warning:M S R Z :rocket:~~3~~ B U :warning:E A L1 C1 :warning:C2 6 BP BA
neqo-latest vs. s2n-quic: H LR C20 M S :rocket:~~R~~ 3 B U E A L1 :rocket:~~L2 C1 C2~~ 6
neqo-latest vs. tquic: :rocket:~~DC~~ LR :warning:C20 M Z 3 :warning:U :rocket:~~B~~ L1 L2 C1 C2 6

neqo-latest as server

chrome vs. neqo-latest: 3
go-x-net vs. neqo-latest: :rocket:~~H~~ DC LR M B :warning:A L2 :rocket:~~U~~ C2 6 BP
kwik vs. neqo-latest: :rocket:~~H DC~~ LR :warning:C20 M :rocket:~~S~~ R Z :warning:3 U :rocket:~~B~~ A :rocket:~~L1 L2~~ C1 :warning:6 :rocket:~~C2 V2~~
lsquic vs. neqo-latest: :warning:DC :rocket:~~H~~ LR M S R 3 B E A L1 L2 C1 :rocket:~~C2~~ 6 :warning:V2 BP :warning:BA CM
msquic vs. neqo-latest: H DC LR C20 M :warning:S Z L1 :rocket:~~R B A~~ L2 :warning:C1 C2 6 :warning:BA
mvfst vs. neqo-latest: :warning:LR :rocket:~~H DC M~~ 3 B :warning:L2 C2 6 :rocket:~~BP~~ BA
ngtcp2 vs. neqo-latest: H DC :warning:LR C20 M S R Z :warning:B U :rocket:~~3~~ E :rocket:~~A~~ L1 :warning:L2 C1 C2 :warning:BA :rocket:~~6 V2 BP~~ CM
openssl vs. neqo-latest: H DC :warning:B :rocket:~~S R 3~~ L2 C2 6 :warning:BP BA
quic-go vs. neqo-latest: H DC :rocket:~~LR~~ C20 M S R Z :warning:3 B U A L1 L2 :warning:C1 C2 :warning:6 BP BA
quiche vs. neqo-latest: H DC LR :rocket:~~M~~ S :warning:R Z 3 :rocket:~~B~~ A L1 :warning:L2 C2 :warning:BP BA :rocket:~~6~~
quinn vs. neqo-latest: :rocket:~~H DC~~ LR C20 :rocket:~~M~~ S R Z 3 :warning:B :rocket:~~U~~ E A :warning:L1 C1 :rocket:~~L2~~ C2 6 BP :warning:BA
s2n-quic vs. neqo-latest: :rocket:~~E L1 C1 C2 BP~~

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: E CM
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2 CM
neqo-latest vs. haproxy: E CM
neqo-latest vs. kwik: E CM
neqo-latest vs. msquic: 3 E CM
neqo-latest vs. mvfst: C20 S E V2 CM
neqo-latest vs. nginx: E V2 CM
neqo-latest vs. picoquic: CM
neqo-latest vs. quic-go: E V2 CM
neqo-latest vs. quiche: E V2 CM
neqo-latest vs. quinn: V2 CM
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. tquic: E V2 CM
neqo-latest vs. xquic: S E V2 CM

neqo-latest as server

chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2 BP BA CM
go-x-net vs. neqo-latest: C20 S R Z 3 U E A L1 C1 V2 BA CM
kwik vs. neqo-latest: U E CM
lsquic vs. neqo-latest: C20 Z U BA
msquic vs. neqo-latest: R 3 E A BP BA CM
mvfst vs. neqo-latest: C20 M S R U E V2 BP CM
ngtcp2 vs. neqo-latest: A BP U BA
openssl vs. neqo-latest: Z U E L1 C1 V2
quic-go vs. neqo-latest: E V2
quiche vs. neqo-latest: C20 U E V2
s2n-quic vs. neqo-latest: C20 Z U V2

Apr 18 '25 13:04 github-actions[bot]

Benchmark results

Performance differences relative to 95f9bedb40bc852f5f62b611ad6b2fd22c636843.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: :green_heart: Performance has improved.

       time:   [202.13 ms 202.48 ms 202.85 ms]
       thrpt:  [492.97 MiB/s 493.86 MiB/s 494.74 MiB/s]
change:
       time:   [−69.170% −69.106% −69.043%] (p = 0.00 +223.69% +224.36%]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold.

       time:   [304.31 ms 305.83 ms 307.35 ms]
       thrpt:  [32.536 Kelem/s 32.698 Kelem/s 32.862 Kelem/s]
change:
       time:   [+0.6959% +1.3800% +2.0501%] (p = 0.00 −1.3612% −0.6911%]

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: :broken_heart: Performance has regressed.

       time:   [27.525 ms 27.597 ms 27.673 ms]
       thrpt:  [36.136  elem/s 36.236  elem/s 36.331  elem/s]
change:
       time:   [+1.1068% +1.8128% +2.4725%] (p = 0.00 −1.7805% −1.0947%]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: :green_heart: Performance has improved.

       time:   [648.95 ms 654.10 ms 659.22 ms]
       thrpt:  [151.69 MiB/s 152.88 MiB/s 154.10 MiB/s]
change:
       time:   [−28.772% −27.945% −27.096%] (p = 0.00 +38.782% +40.394%]
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low severe
4 (4.00%) low mild
2 (2.00%) high severe

decode 4096 bytes, mask ff: No change in performance detected.

       time:   [11.792 µs 11.818 µs 11.851 µs]
       change: [−0.7711% −0.1718% +0.3634%] (p = 0.57 > 0.05)
Found 15 outliers among 100 measurements (15.00%)
3 (3.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.

       time:   [3.0229 ms 3.0323 ms 3.0435 ms]
       change: [−0.2404% +0.1966% +0.6361%] (p = 0.39 > 0.05)
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.

       time:   [19.968 µs 20.023 µs 20.082 µs]
       change: [−0.7959% −0.1788% +0.3899%] (p = 0.57 > 0.05)
Found 21 outliers among 100 measurements (21.00%)
1 (1.00%) low severe
4 (4.00%) low mild
16 (16.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.

       time:   [5.0371 ms 5.0487 ms 5.0618 ms]
       change: [−0.5165% −0.1114% +0.2906%] (p = 0.59 > 0.05)
Found 14 outliers among 100 measurements (14.00%)
14 (14.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.

       time:   [8.2722 µs 8.3105 µs 8.3530 µs]
       change: [−0.2019% +0.2442% +0.7830%] (p = 0.32 > 0.05)
Found 19 outliers among 100 measurements (19.00%)
6 (6.00%) low mild
2 (2.00%) high mild
11 (11.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.

       time:   [1.5850 ms 1.5902 ms 1.5962 ms]
       change: [−0.6684% −0.1223% +0.4020%] (p = 0.66 > 0.05)
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe

1000 streams of 1 bytes/multistream: No change in performance detected.

       time:   [33.203 ns 39.555 ns 51.878 ns]
       change: [+10.728% +32.674% +75.025%] (p = 0.06 > 0.05)
Found 3 outliers among 500 measurements (0.60%)
1 (0.20%) high mild
2 (0.40%) high severe

1000 streams of 1000 bytes/multistream: :broken_heart: Performance has regressed.

       time:   [34.055 ns 34.490 ns 34.929 ns]
       change: [+12.649% +14.534% +16.408%] (p = 0.00 Found 1 outliers among 500 measurements (0.20%)
1 (0.20%) high severe

coalesce_acked_from_zero 1+1 entries: No change in performance detected.

       time:   [88.115 ns 88.448 ns 88.789 ns]
       change: [−0.4490% +0.5174% +1.7914%] (p = 0.49 > 0.05)
Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) high mild
4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.

       time:   [105.48 ns 105.73 ns 105.99 ns]
       change: [−0.8906% −0.3531% +0.1143%] (p = 0.18 > 0.05)
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
2 (2.00%) high mild
5 (5.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.

       time:   [105.05 ns 105.38 ns 105.80 ns]
       change: [−0.2958% +0.2637% +0.8589%] (p = 0.38 > 0.05)
Found 21 outliers among 100 measurements (21.00%)
4 (4.00%) low severe
6 (6.00%) low mild
3 (3.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.

       time:   [88.820 ns 88.971 ns 89.126 ns]
       change: [−0.7026% +0.2281% +1.1404%] (p = 0.65 > 0.05)
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): No change in performance detected.

       time:   [107.81 ms 107.97 ms 108.23 ms]
       change: [−0.4699% −0.1075% +0.2218%] (p = 0.59 > 0.05)
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe

sent::Packets::take_ranges: No change in performance detected.

       time:   [8.0612 µs 8.2610 µs 8.4441 µs]
       change: [−0.7407% +5.8984% +17.161%] (p = 0.24 > 0.05)
Found 20 outliers among 100 measurements (20.00%)
4 (4.00%) low severe
11 (11.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe

transfer/pacing-false/varying-seeds: :broken_heart: Performance has regressed.

       time:   [37.072 ms 37.169 ms 37.279 ms]
       change: [+4.5101% +4.8981% +5.2577%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-true/varying-seeds: :broken_heart: Performance has regressed.

       time:   [37.692 ms 37.808 ms 37.931 ms]
       change: [+5.0829% +5.4940% +5.9561%] (p = 0.00 Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe

transfer/pacing-false/same-seed: :broken_heart: Performance has regressed.

       time:   [36.999 ms 37.067 ms 37.140 ms]
       change: [+4.6033% +4.8770% +5.1647%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-true/same-seed: :broken_heart: Performance has regressed.

       time:   [38.372 ms 38.472 ms 38.576 ms]
       change: [+4.1365% +4.4851% +4.8031%] (p = 0.00 Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe

Client/server transfer results

Performance differences relative to 95f9bedb40bc852f5f62b611ad6b2fd22c636843.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params)	Mean ± σ	Min	Max	MiB/s ± σ	Δ `main`	Δ `main`
google vs. google	451.8 ± 4.7	444.9	461.2	70.8 ± 6.8
google vs. neqo (cubic, paced)	268.6 ± 4.5	261.4	283.8	119.2 ± 7.1	:green_heart: -49.9	-15.7%
msquic vs. msquic	133.0 ± 34.2	100.8	374.4	240.6 ± 0.9
msquic vs. neqo (cubic, paced)	145.8 ± 16.7	121.6	225.1	219.5 ± 1.9	:green_heart: -125.9	-46.3%
neqo vs. google (cubic, paced)	751.4 ± 4.5	743.5	769.3	42.6 ± 7.1	-0.5	-0.1%
neqo vs. msquic (cubic, paced)	155.6 ± 5.0	147.3	176.0	205.6 ± 6.4	-0.6	-0.4%
neqo vs. neqo (cubic)	90.0 ± 4.7	78.9	105.0	355.7 ± 6.8	:green_heart: -121.0	-57.4%
neqo vs. neqo (cubic, paced)	90.2 ± 4.0	82.7	99.1	354.7 ± 8.0	:green_heart: -121.0	-57.3%
neqo vs. neqo (reno)	90.8 ± 5.2	80.3	108.5	352.5 ± 6.2	:green_heart: -118.3	-56.6%
neqo vs. neqo (reno, paced)	93.2 ± 5.3	82.0	113.0	343.2 ± 6.0	:green_heart: -116.8	-55.6%
neqo vs. quiche (cubic, paced)	191.7 ± 4.2	185.4	202.1	167.0 ± 7.6	:broken_heart: 2.3	1.2%
neqo vs. s2n (cubic, paced)	217.8 ± 4.6	210.3	225.9	146.9 ± 7.0	1.1	0.5%
quiche vs. neqo (cubic, paced)	157.6 ± 5.8	146.1	183.5	203.1 ± 5.5	:green_heart: -590.4	-78.9%
quiche vs. quiche	147.0 ± 4.9	137.7	164.8	217.6 ± 6.5
s2n vs. neqo (cubic, paced)	172.1 ± 5.0	161.3	183.3	186.0 ± 6.4	:green_heart: -126.3	-42.3%
s2n vs. s2n	248.2 ± 27.7	230.3	345.1	128.9 ± 1.2

Download data for profiler.firefox.com or download performance comparison data.

Apr 18 '25 13:04 github-actions[bot]

Optimized Upload only thus far.

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
   time:   [1.2891 s 1.2983 s 1.3077 s]
   thrpt:  [76.469 MiB/s 77.023 MiB/s 77.571 MiB/s]
change: time: [-32.828% -31.716% -30.597%] (p = 0.00 < 0.05) thrpt: [+44.086% +46.447% +48.872%]

Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild

:tada: matches https://github.com/mozilla/neqo/pull/2532#issuecomment-2758283036.

Apr 18 '25 14:04 mxinden

Introduced the same optimizations to neqo-server. In addition I removed the memory copy, now allocating each datagram of a GSO train into a single contiguous Vec right away. Result looks promising.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
   time:   [245.19 ms 245.65 ms 246.12 ms]
   thrpt:  [406.30 MiB/s 407.09 MiB/s 407.85 MiB/s]
change: time: [-66.225% -66.008% -65.788%] (p = 0.00 < 0.05) thrpt: [+192.30% +194.19% +196.08%]

Apr 21 '25 16:04 mxinden

Why do we see a massive benefit in the client/server tests, but not in the transfer benches?

Jun 13 '25 05:06 larseggert

@larseggert the neqo-transport/bench/transfer.rs benchmarks use the test-fixtures/src/sim Simulator. The Simulator only processes a single datagram at a time.

https://github.com/mozilla/neqo/blob/37c3aeebb79aef9f9649c54c3bbfae84fee523b3/test-fixture/src/sim/mod.rs#L206

Let me see whether I can change that as part of this pull request. After all our benchmarks and tests should mirror how we run Neqo in Firefox as close as possible.

Jun 13 '25 07:06 mxinden

Codecov Report

Attention: Patch coverage is 95.93810% with 21 lines in your changes missing coverage. Please review.

Project coverage is 95.56%. Comparing base (d16866a) to head (79a3266).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2593      +/-   ##
==========================================
- Coverage   95.59%   95.56%   -0.03%     
==========================================
  Files         115      115              
  Lines       37712    37956     +244     
  Branches    37712    37956     +244     
==========================================
+ Hits        36049    36271     +222     
- Misses       1657     1680      +23     
+ Partials        6        5       -1

Components	Coverage Δ
neqo-common	`97.42% <92.72%> (-0.38%)`	:arrow_down:
neqo-crypto	`90.49% <ø> (ø)`
neqo-http3	`94.50% <100.00%> (+0.01%)`	:arrow_up:
neqo-qpack	`96.28% <ø> (ø)`
neqo-transport	`96.53% <97.97%> (-0.03%)`	:arrow_down:
neqo-udp	`90.53% <82.85%> (-1.42%)`	:arrow_down:

Jun 19 '25 07:06 codecov[bot]

This pull request is ready for review. Note that benchmark results are inaccurate due to https://github.com/mozilla/neqo/pull/2743.

Jun 19 '25 11:06 mxinden

This pull request is ready for review. Note that benchmark results are inaccurate due to #2743.

I'm merging #2743 now, so we can get fresh/correct bench data.

Jun 19 '25 12:06 larseggert

Thanks for the quick review!

For the record, MacOS failure is due to different EMSGSIZE handling in quinn-udp (https://github.com/quinn-rs/quinn/pull/2199). Will push a simple patch.

Jun 19 '25 14:06 mxinden

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-280

🚨 6 Alerts

Benchmark	Measure Units	View	Benchmark Result (Result Δ%)	Upper Boundary (Limit %)
coalesce_acked_from_zero 10+1 entries	Latency nanoseconds (ns)	📈 plot 🚷 threshold 🚨 alert (🔔)	107.63 ns (+1.59%) Baseline: 105.94 ns	106.97 ns (100.62%)
coalesce_acked_from_zero 1000+1 entries	Latency nanoseconds (ns)	📈 plot 🚷 threshold 🚨 alert (🔔)	95.73 ns (+7.14%) Baseline: 89.35 ns	92.15 ns (103.88%)
transfer/pacing-false/same-seed	Latency milliseconds (ms)	📈 plot 🚷 threshold 🚨 alert (🔔)	36.89 ms (+5.37%) Baseline: 35.01 ms	36.81 ms (100.21%)
transfer/pacing-false/varying-seeds	Latency milliseconds (ms)	📈 plot 🚷 threshold 🚨 alert (🔔)	37.28 ms (+6.05%) Baseline: 35.15 ms	36.99 ms (100.78%)
transfer/pacing-true/same-seed	Latency milliseconds (ms)	📈 plot 🚷 threshold 🚨 alert (🔔)	38.66 ms (+5.61%) Baseline: 36.61 ms	38.32 ms (100.88%)
transfer/pacing-true/varying-seeds	Latency milliseconds (ms)	📈 plot 🚷 threshold 🚨 alert (🔔)	38.16 ms (+6.04%) Baseline: 35.98 ms	37.73 ms (101.12%)

Click to view all benchmark results

Benchmark	Latency	Benchmark Result nanoseconds (ns) (Result Δ%)	Upper Boundary nanoseconds (ns) (Limit %)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client	📈 view plot 🚷 view threshold	640,320,000.00 ns (-3.54%) Baseline: 663,820,375.00 ns	728,934,998.14 ns (87.84%)
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client	📈 view plot 🚷 view threshold	200,420,000.00 ns (-68.10%) Baseline: 628,188,375.00 ns	852,579,930.08 ns (23.51%)
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client	📈 view plot 🚷 view threshold	27,471,000.00 ns (+1.07%) Baseline: 27,180,837.50 ns	27,655,811.64 ns (99.33%)
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client	📈 view plot 🚷 view threshold	309,930,000.00 ns (+1.66%) Baseline: 304,879,875.00 ns	315,750,492.21 ns (98.16%)
1000 streams of 1 bytes/multistream	📈 view plot 🚷 view threshold	38.32 ns (+3.24%) Baseline: 37.11 ns	53.66 ns (71.41%)
1000 streams of 1000 bytes/multistream	📈 view plot 🚷 view threshold	38.63 ns (+5.26%) Baseline: 36.70 ns	53.25 ns (72.55%)
RxStreamOrderer::inbound_frame()	📈 view plot 🚷 view threshold	109,970,000.00 ns (-0.47%) Baseline: 110,483,762.50 ns	114,400,196.45 ns (96.13%)
coalesce_acked_from_zero 1+1 entries	📈 view plot 🚷 view threshold	88.51 ns (-0.19%) Baseline: 88.68 ns	89.29 ns (99.13%)
coalesce_acked_from_zero 10+1 entries	📈 view plot 🚷 view threshold 🚨 view alert (🔔)	107.63 ns (+1.59%) Baseline: 105.94 ns	106.97 ns (100.62%)
coalesce_acked_from_zero 1000+1 entries	📈 view plot 🚷 view threshold 🚨 view alert (🔔)	95.73 ns (+7.14%) Baseline: 89.35 ns	92.15 ns (103.88%)
coalesce_acked_from_zero 3+1 entries	📈 view plot 🚷 view threshold	106.31 ns (-0.19%) Baseline: 106.51 ns	107.36 ns (99.02%)
decode 1048576 bytes, mask 3f	📈 view plot 🚷 view threshold	1,596,600.00 ns (-1.11%) Baseline: 1,614,482.50 ns	1,757,188.61 ns (90.86%)
decode 1048576 bytes, mask 7f	📈 view plot 🚷 view threshold	5,060,900.00 ns (-0.05%) Baseline: 5,063,310.00 ns	5,089,343.37 ns (99.44%)
decode 1048576 bytes, mask ff	📈 view plot 🚷 view threshold	3,031,600.00 ns (-0.12%) Baseline: 3,035,103.75 ns	3,066,146.08 ns (98.87%)
decode 4096 bytes, mask 3f	📈 view plot 🚷 view threshold	8,273.50 ns (+4.04%) Baseline: 7,952.18 ns	10,113.99 ns (81.80%)
decode 4096 bytes, mask 7f	📈 view plot 🚷 view threshold	20,017.00 ns (+0.46%) Baseline: 19,924.92 ns	20,394.59 ns (98.15%)
decode 4096 bytes, mask ff	📈 view plot 🚷 view threshold	11,841.00 ns (+0.20%) Baseline: 11,817.16 ns	11,970.83 ns (98.92%)
sent::Packets::take_ranges	📈 view plot 🚷 view threshold	8,447.90 ns (+0.24%) Baseline: 8,427.93 ns	8,481.15 ns (99.61%)
transfer/pacing-false/same-seed	📈 view plot 🚷 view threshold 🚨 view alert (🔔)	36,886,000.00 ns (+5.37%) Baseline: 35,006,825.00 ns	36,809,827.13 ns (100.21%)
transfer/pacing-false/varying-seeds	📈 view plot 🚷 view threshold 🚨 view alert (🔔)	37,278,000.00 ns (+6.05%) Baseline: 35,150,025.00 ns	36,990,430.22 ns (100.78%)
transfer/pacing-true/same-seed	📈 view plot 🚷 view threshold 🚨 view alert (🔔)	38,661,000.00 ns (+5.61%) Baseline: 36,608,950.00 ns	38,323,633.10 ns (100.88%)
transfer/pacing-true/varying-seeds	📈 view plot 🚷 view threshold 🚨 view alert (🔔)	38,156,000.00 ns (+6.04%) Baseline: 35,982,850.00 ns	37,732,380.19 ns (101.12%)

🐰 View full continuous benchmarking report in Bencher

Jun 19 '25 16:06 github-actions[bot]

For the record, MacOS failure is due to different EMSGSIZE handling in quinn-udp (quinn-rs/quinn#2199). Will push a simple patch.

Will be fixed in https://github.com/mozilla/neqo/pull/2746.

Jun 19 '25 18:06 mxinden

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-280

Click to view all benchmark results

Benchmark	Latency	Benchmark Result milliseconds (ms) (Result Δ%)	Upper Boundary milliseconds (ms) (Limit %)
s2n vs. neqo (cubic, paced)	📈 view plot 🚷 view threshold	215.73 ms (-30.65%) Baseline: 311.08 ms	349.46 ms (61.73%)

🐰 View full continuous benchmarking report in Bencher

Jun 19 '25 20:06 github-actions[bot]

@mxinden I fixed up the doc comments, but there are still Windows test failures.

Jun 25 '25 08:06 larseggert

@mxinden tests::send_ignore_emsgsize still failing on Windows.

Jun 27 '25 09:06 larseggert

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-279

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client	📈 view plot 🚷 view threshold	646,670,000.00 ns
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client	📈 view plot 🚷 view threshold	201,340,000.00 ns
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client	📈 view plot 🚷 view threshold	27,380,000.00 ns
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client	📈 view plot 🚷 view threshold	307,020,000.00 ns
1000 streams of 1 bytes/multistream	📈 view plot 🚷 view threshold	34.99 ns
1000 streams of 1000 bytes/multistream	📈 view plot 🚷 view threshold	35.03 ns
RxStreamOrderer::inbound_frame()	📈 view plot 🚷 view threshold	110,960,000.00 ns
coalesce_acked_from_zero 1+1 entries	📈 view plot 🚷 view threshold	88.31 ns
coalesce_acked_from_zero 10+1 entries	📈 view plot 🚷 view threshold	105.52 ns
coalesce_acked_from_zero 1000+1 entries	📈 view plot 🚷 view threshold	90.91 ns
coalesce_acked_from_zero 3+1 entries	📈 view plot 🚷 view threshold	105.85 ns
decode 1048576 bytes, mask 3f	📈 view plot 🚷 view threshold	1,590,700.00 ns
decode 1048576 bytes, mask 7f	📈 view plot 🚷 view threshold	5,047,400.00 ns
decode 1048576 bytes, mask ff	📈 view plot 🚷 view threshold	3,031,800.00 ns
decode 4096 bytes, mask 3f	📈 view plot 🚷 view threshold	8,308.50 ns
decode 4096 bytes, mask 7f	📈 view plot 🚷 view threshold	20,011.00 ns
decode 4096 bytes, mask ff	📈 view plot 🚷 view threshold	11,832.00 ns
sent::Packets::take_ranges	📈 view plot 🚷 view threshold	5,182.40 ns
transfer/pacing-false/same-seed	📈 view plot 🚷 view threshold	36,846,000.00 ns
transfer/pacing-false/varying-seeds	📈 view plot 🚷 view threshold	37,089,000.00 ns
transfer/pacing-true/same-seed	📈 view plot 🚷 view threshold	38,620,000.00 ns
transfer/pacing-true/varying-seeds	📈 view plot 🚷 view threshold	38,194,000.00 ns

🐰 View full continuous benchmarking report in Bencher

Jun 27 '25 09:06 github-actions[bot]

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-279

Click to view all benchmark results

Benchmark	Latency	milliseconds (ms)
s2n vs. neqo (cubic, paced)	📈 view plot 🚷 view threshold	210.06 ms

🐰 View full continuous benchmarking report in Bencher

Jun 27 '25 09:06 github-actions[bot]

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-278

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client	📈 view plot 🚷 view threshold	654,100,000.00 ns
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client	📈 view plot 🚷 view threshold	202,480,000.00 ns
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client	📈 view plot 🚷 view threshold	27,597,000.00 ns
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client	📈 view plot 🚷 view threshold	305,830,000.00 ns
1000 streams of 1 bytes/multistream	📈 view plot 🚷 view threshold	39.55 ns
1000 streams of 1000 bytes/multistream	📈 view plot 🚷 view threshold	34.49 ns
RxStreamOrderer::inbound_frame()	📈 view plot 🚷 view threshold	107,970,000.00 ns
coalesce_acked_from_zero 1+1 entries	📈 view plot 🚷 view threshold	88.45 ns
coalesce_acked_from_zero 10+1 entries	📈 view plot 🚷 view threshold	105.38 ns
coalesce_acked_from_zero 1000+1 entries	📈 view plot 🚷 view threshold	88.97 ns
coalesce_acked_from_zero 3+1 entries	📈 view plot 🚷 view threshold	105.73 ns
decode 1048576 bytes, mask 3f	📈 view plot 🚷 view threshold	1,590,200.00 ns
decode 1048576 bytes, mask 7f	📈 view plot 🚷 view threshold	5,048,700.00 ns
decode 1048576 bytes, mask ff	📈 view plot 🚷 view threshold	3,032,300.00 ns
decode 4096 bytes, mask 3f	📈 view plot 🚷 view threshold	8,310.50 ns
decode 4096 bytes, mask 7f	📈 view plot 🚷 view threshold	20,023.00 ns
decode 4096 bytes, mask ff	📈 view plot 🚷 view threshold	11,818.00 ns
sent::Packets::take_ranges	📈 view plot 🚷 view threshold	8,261.00 ns
transfer/pacing-false/same-seed	📈 view plot 🚷 view threshold	37,067,000.00 ns
transfer/pacing-false/varying-seeds	📈 view plot 🚷 view threshold	37,169,000.00 ns
transfer/pacing-true/same-seed	📈 view plot 🚷 view threshold	38,472,000.00 ns
transfer/pacing-true/varying-seeds	📈 view plot 🚷 view threshold	37,808,000.00 ns

🐰 View full continuous benchmarking report in Bencher

Jun 30 '25 12:06 github-actions[bot]

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-278

Click to view all benchmark results

Benchmark	Latency	milliseconds (ms)
s2n vs. neqo (cubic, paced)	📈 view plot 🚷 view threshold	172.07 ms

🐰 View full continuous benchmarking report in Bencher

Jun 30 '25 12:06 github-actions[bot]

@mxinden is this ready to merge?

Jun 30 '25 13:06 larseggert

Yes, ready to merge from my end. We have a couple of benchmark regressions. Explainer for each:

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
   time:   [650.03 ms 655.28 ms 660.94 ms]
   thrpt:  [151.30 MiB/s 152.61 MiB/s 153.84 MiB/s]
change: time: [−27.566% −26.708% −25.736%] (p = 0.00 < 0.05) thrpt: [+34.655% +36.441% +38.056%]

This will improve even further with https://github.com/mozilla/neqo/pull/2734.

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.
   time:   [27.404 ms 27.500 ms 27.616 ms]
   thrpt:  [36.211  elem/s 36.363  elem/s 36.491  elem/s]
change: time: [+1.5705% +2.1502% +2.7519%] (p = 0.00 < 0.05) thrpt: [−2.6782% −2.1049% −1.5463%]

This is expected. We pay a slight cost in latency when sending in batches.

1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.
   time:   [36.454 ns 36.834 ns 37.215 ns]
   change: [+25.596% +27.533% +29.527%] (p = 0.00 < 0.05)

This should be due to neqo-http3/benches/streams.rs not using the batched IO paths. Instead of altering the IO handling in the benchmark, I suggest we do https://github.com/mozilla/neqo/issues/2728. Given that the benchmark measures stream performance and not UDP IO performance, I suggest doing this in a follow-up.

transfer/pacing-false/varying-seeds: 💔 Performance has regressed.
   time:   [36.886 ms 36.956 ms 37.027 ms]
   change: [+4.0332% +4.3753% +4.6740%] (p = 0.00 < 0.05)

Again, slight regression as the Simulator is not using the batched IO paths. The non-batched IO path (i.e. process), now no-longer pre-allocate, as we don't know the datagram size ahead of time. Once https://github.com/mozilla/neqo/pull/2747 is merged, this overhead should be reduced, as we would write datagrams into a long-lived buffer.

@larseggert let me know whether you are fine proceeding here, or would prefer any of the above to be addressed first.

Jun 30 '25 14:06 mxinden

I'll merge now; please do issues for the missing bits?

Great we can land this!

Jun 30 '25 16:06 larseggert

This keeps getting kicked out of the merge queue while tests are still running and haven't failed yet. I think GitHub may have issues. Doing a force merge.

Jul 01 '25 08:07 larseggert

please do issues for the missing bits?

I assume you are fine with the following pull requests tracking the progress. Let me know if you want additional GitHub issues.

https://github.com/mozilla/neqo/pull/2734
https://github.com/mozilla/neqo/issues/2728
https://github.com/mozilla/neqo/pull/2747

Jul 06 '25 17:07 mxinden

Early numbers on GSO in Firefox Nightly:

~5% of sends on Linux and Windows use GSO with 2 or more segments
~5% of sends on Linux and Windows send 2.4 k bytes or more
We currently limit number of segments to 10, which is reflected in the metrics (apart from some crazy machine on Linux doing > 100)

Good signals. We should explore increasing max number of segments (currently 10). Maybe just limit by what our pacer allows to send.

Datagram (batch) size

Windows

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Windows&visiblePercentiles=%5B99%2C95%2C75%2C50%2C25%2C5%5D

Linux

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Number of segments in a batch

Windows

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Windows&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Linux

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Jul 22 '25 10:07 mxinden

Yes, let's increase.

Jul 22 '25 11:07 larseggert