perf: use GSO (attempt 3)
Attempt 1: https://github.com/mozilla/neqo/commit/f25b0b77ff579b56a4ea882a3ca70404b3c38b03 Attempt 2: https://github.com/mozilla/neqo/pull/2532/
Compared to attempt 2:
- implements the datagram batching in
neqo-transportinstead ofneqo-bin - does not copy each datagram in the larger GSO buffer, but instead writes each into the GSO buffer right away.
Failed Interop Tests
QUIC Interop Runner, client vs. server, differences relative to 66be2e61c9e899e91a1c9b27b053de15c125731e.
neqo-latest as client
- neqo-latest vs. aioquic: :rocket:~~C20 M S~~ Z :rocket:~~3~~ :warning:U L1 L2 :rocket:~~BP~~ :warning:C2 BA
- neqo-latest vs. go-x-net: :rocket:~~H DC M B A C2~~ :warning:U L2 6 BP BA
- neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2 BP BA
- neqo-latest vs. kwik: :warning:H DC LR C20 M S R :warning:Z 3 B U :warning:A L1 :warning:L2 C1 :warning:C2 6 V2 BP BA
- neqo-latest vs. linuxquic: H DC :warning:LR C20 M S :warning:R Z 3 B U E A L1 L2 C1 :warning:C2 6 V2 :warning:BP BA :warning:CM
- neqo-latest vs. lsquic: :warning:H LR :rocket:~~U~~ :warning:C20 E :warning:A L1 C1 :rocket:~~C2~~ :warning:6 BP CM
- neqo-latest vs. msquic: H :warning:DC LR C20 M S :warning:R Z B :warning:U A L1 :warning:L2 C1 C2 :warning:6 V2 BP BA
- neqo-latest vs. mvfst: :rocket:~~U~~ :warning:DC Z 3 A L1 C1 :rocket:~~C2 6~~ :warning:BP BA
- neqo-latest vs. neqo: run cancelled after 20 min
- neqo-latest vs. neqo-latest: :warning:H DC LR M S R Z :warning:3 A :rocket:~~L2~~ :warning:L1 C1 :rocket:~~6~~ V2 :warning:BP BA :warning:CM
- neqo-latest vs. nginx: :warning:H DC :rocket:~~LR 3 A L1~~ :warning:M R U BP BA
- neqo-latest vs. ngtcp2: LR :warning:C20 S Z :warning:3 B :warning:L1 BA CM
- neqo-latest vs. picoquic: :rocket:~~S~~ :warning:H DC Z B A :warning:L1 C1 6
- neqo-latest vs. quic-go: LR :rocket:~~S~~ :warning:M R 3 B A :rocket:~~L2 C2~~
- neqo-latest vs. quiche: :rocket:~~S~~ :warning:H LR C20 3 U A L1 :rocket:~~L2~~ :warning:6 BP BA
- neqo-latest vs. quinn: :rocket:~~DC 3~~ :warning:M S R E L2 :warning:C2
- neqo-latest vs. s2n-quic: DC :rocket:~~R L2 C1 C2~~ BP BA CM
- neqo-latest vs. tquic: H :rocket:~~DC~~ :warning:C20 S R :rocket:~~B~~ :warning:U A BP BA
- neqo-latest vs. xquic: H :warning:DC LR C20 :warning:M R Z 3 B U A L1 :warning:L2 C1 :warning:C2 6 BP :warning:BA
neqo-latest as server
- aioquic vs. neqo-latest: run cancelled after 20 min
- go-x-net vs. neqo-latest: :rocket:~~H~~ :warning:L2 CM
- kwik vs. neqo-latest: :rocket:~~H DC S B L1 L2 C2 V2~~ :warning:C20 3 6 BP BA :warning:CM
- linuxquic vs. neqo-latest: run cancelled after 20 min
- lsquic vs. neqo-latest: :rocket:~~H C2~~ :warning:DC V2
- msquic vs. neqo-latest: :rocket:~~B~~ :warning:S Z U :warning:L1 C1 V2 :rocket:~~CM~~
- mvfst vs. neqo-latest: :rocket:~~H DC~~ :warning:LR Z A L1 :warning:L2 C1 :warning:CM
- neqo vs. neqo-latest: run cancelled after 20 min
- ngtcp2 vs. neqo-latest: :rocket:~~3 6 V2~~ :warning:LR C20 B L2
- openssl vs. neqo-latest: LR C20 M :rocket:~~S R 3~~ :warning:B A :warning:BP CM
- picoquic vs. neqo-latest: run cancelled after 20 min
- quic-go vs. neqo-latest: :rocket:~~LR~~ :warning:3 B C1 6 CM
- quiche vs. neqo-latest: :rocket:~~M B~~ :warning:R Z L2 C1 :rocket:~~6~~ :warning:BP BA CM
- quinn vs. neqo-latest: :rocket:~~H DC M U L2~~ :warning:B L1 C1 V2 :warning:BA CM
- s2n-quic vs. neqo-latest: :warning:H DC LR M S R 3 B A L2 6 BA CM
- tquic vs. neqo-latest: run cancelled after 20 min
- xquic vs. neqo-latest: run cancelled after 20 min
All results
Succeeded Interop Tests
QUIC Interop Runner, client vs. server
neqo-latest as client
- neqo-latest vs. aioquic: H DC LR :rocket:~~C20 M S~~ R :rocket:~~3~~ B :warning:U A :warning:L1 C1 :warning:C2 6 V2 :warning:BA :rocket:~~BP~~
- neqo-latest vs. go-x-net: :rocket:~~H DC~~ LR :warning:U L2 6 :rocket:~~M B A C2~~
- neqo-latest vs. lsquic: :warning:H DC :warning:C20 M S R Z 3 B :warning:A :rocket:~~U~~ L2 :warning:6 :rocket:~~C2~~ V2 :warning:BP BA :warning:CM
- neqo-latest vs. mvfst: H :warning:DC LR M R :warning:Z 3 B :rocket:~~U~~ L2 :warning:BP BA :rocket:~~C2 6~~
- neqo-latest vs. neqo-latest: :warning:H DC LR C20 :warning:M S 3 B U E :warning:L1 :rocket:~~L2~~ C2 :warning:BP CM :rocket:~~6~~
- neqo-latest vs. nginx: :warning:H :rocket:~~LR~~ C20 :warning:M S :warning:R Z :rocket:~~3~~ B :warning:U :rocket:~~A L1~~ L2 C1 C2 6
- neqo-latest vs. ngtcp2: H DC :warning:C20 M :warning:S R :warning:3 U E A :warning:L1 L2 C1 C2 6 V2 BP :warning:BA
- neqo-latest vs. picoquic: :warning:H DC LR C20 M :rocket:~~S~~ R 3 U E :warning:L1 L2 C2 V2 BP BA
- neqo-latest vs. quic-go: H DC C20 :warning:M R :rocket:~~S~~ Z :warning:3 B U L1 :rocket:~~L2~~ C1 :rocket:~~C2~~ 6 BP BA
- neqo-latest vs. quiche: :warning:H DC :warning:LR C20 M :rocket:~~S~~ R Z :warning:3 B :warning:U A :rocket:~~L2~~ C1 C2 :warning:6
- neqo-latest vs. quinn: H :rocket:~~DC~~ LR C20 :warning:M S R Z :rocket:~~3~~ B U :warning:E A L1 C1 :warning:C2 6 BP BA
- neqo-latest vs. s2n-quic: H LR C20 M S :rocket:~~R~~ 3 B U E A L1 :rocket:~~L2 C1 C2~~ 6
- neqo-latest vs. tquic: :rocket:~~DC~~ LR :warning:C20 M Z 3 :warning:U :rocket:~~B~~ L1 L2 C1 C2 6
neqo-latest as server
- chrome vs. neqo-latest: 3
- go-x-net vs. neqo-latest: :rocket:~~H~~ DC LR M B :warning:A L2 :rocket:~~U~~ C2 6 BP
- kwik vs. neqo-latest: :rocket:~~H DC~~ LR :warning:C20 M :rocket:~~S~~ R Z :warning:3 U :rocket:~~B~~ A :rocket:~~L1 L2~~ C1 :warning:6 :rocket:~~C2 V2~~
- lsquic vs. neqo-latest: :warning:DC :rocket:~~H~~ LR M S R 3 B E A L1 L2 C1 :rocket:~~C2~~ 6 :warning:V2 BP :warning:BA CM
- msquic vs. neqo-latest: H DC LR C20 M :warning:S Z L1 :rocket:~~R B A~~ L2 :warning:C1 C2 6 :warning:BA
- mvfst vs. neqo-latest: :warning:LR :rocket:~~H DC M~~ 3 B :warning:L2 C2 6 :rocket:~~BP~~ BA
- ngtcp2 vs. neqo-latest: H DC :warning:LR C20 M S R Z :warning:B U :rocket:~~3~~ E :rocket:~~A~~ L1 :warning:L2 C1 C2 :warning:BA :rocket:~~6 V2 BP~~ CM
- openssl vs. neqo-latest: H DC :warning:B :rocket:~~S R 3~~ L2 C2 6 :warning:BP BA
- quic-go vs. neqo-latest: H DC :rocket:~~LR~~ C20 M S R Z :warning:3 B U A L1 L2 :warning:C1 C2 :warning:6 BP BA
- quiche vs. neqo-latest: H DC LR :rocket:~~M~~ S :warning:R Z 3 :rocket:~~B~~ A L1 :warning:L2 C2 :warning:BP BA :rocket:~~6~~
- quinn vs. neqo-latest: :rocket:~~H DC~~ LR C20 :rocket:~~M~~ S R Z 3 :warning:B :rocket:~~U~~ E A :warning:L1 C1 :rocket:~~L2~~ C2 6 BP :warning:BA
- s2n-quic vs. neqo-latest: :rocket:~~E L1 C1 C2 BP~~
Unsupported Interop Tests
QUIC Interop Runner, client vs. server
neqo-latest as client
- neqo-latest vs. aioquic: E CM
- neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2 CM
- neqo-latest vs. haproxy: E CM
- neqo-latest vs. kwik: E CM
- neqo-latest vs. msquic: 3 E CM
- neqo-latest vs. mvfst: C20 S E V2 CM
- neqo-latest vs. nginx: E V2 CM
- neqo-latest vs. picoquic: CM
- neqo-latest vs. quic-go: E V2 CM
- neqo-latest vs. quiche: E V2 CM
- neqo-latest vs. quinn: V2 CM
- neqo-latest vs. s2n-quic: Z V2
- neqo-latest vs. tquic: E V2 CM
- neqo-latest vs. xquic: S E V2 CM
neqo-latest as server
- chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2 BP BA CM
- go-x-net vs. neqo-latest: C20 S R Z 3 U E A L1 C1 V2 BA CM
- kwik vs. neqo-latest: U E CM
- lsquic vs. neqo-latest: C20 Z U BA
- msquic vs. neqo-latest: R 3 E A BP BA CM
- mvfst vs. neqo-latest: C20 M S R U E V2 BP CM
- ngtcp2 vs. neqo-latest: A BP U BA
- openssl vs. neqo-latest: Z U E L1 C1 V2
- quic-go vs. neqo-latest: E V2
- quiche vs. neqo-latest: C20 U E V2
- s2n-quic vs. neqo-latest: C20 Z U V2
Benchmark results
Performance differences relative to 95f9bedb40bc852f5f62b611ad6b2fd22c636843.
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: :green_heart: Performance has improved.
time: [202.13 ms 202.48 ms 202.85 ms]
thrpt: [492.97 MiB/s 493.86 MiB/s 494.74 MiB/s]
change:
time: [−69.170% −69.106% −69.043%] (p = 0.00 +223.69% +224.36%]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold.
time: [304.31 ms 305.83 ms 307.35 ms]
thrpt: [32.536 Kelem/s 32.698 Kelem/s 32.862 Kelem/s]
change:
time: [+0.6959% +1.3800% +2.0501%] (p = 0.00 −1.3612% −0.6911%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: :broken_heart: Performance has regressed.
time: [27.525 ms 27.597 ms 27.673 ms]
thrpt: [36.136 elem/s 36.236 elem/s 36.331 elem/s]
change:
time: [+1.1068% +1.8128% +2.4725%] (p = 0.00 −1.7805% −1.0947%]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: :green_heart: Performance has improved.
time: [648.95 ms 654.10 ms 659.22 ms]
thrpt: [151.69 MiB/s 152.88 MiB/s 154.10 MiB/s]
change:
time: [−28.772% −27.945% −27.096%] (p = 0.00 +38.782% +40.394%]
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low severe
4 (4.00%) low mild
2 (2.00%) high severe
decode 4096 bytes, mask ff: No change in performance detected.
time: [11.792 µs 11.818 µs 11.851 µs]
change: [−0.7711% −0.1718% +0.3634%] (p = 0.57 > 0.05)
Found 15 outliers among 100 measurements (15.00%)
3 (3.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe
decode 1048576 bytes, mask ff: No change in performance detected.
time: [3.0229 ms 3.0323 ms 3.0435 ms]
change: [−0.2404% +0.1966% +0.6361%] (p = 0.39 > 0.05)
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high severe
decode 4096 bytes, mask 7f: No change in performance detected.
time: [19.968 µs 20.023 µs 20.082 µs]
change: [−0.7959% −0.1788% +0.3899%] (p = 0.57 > 0.05)
Found 21 outliers among 100 measurements (21.00%)
1 (1.00%) low severe
4 (4.00%) low mild
16 (16.00%) high severe
decode 1048576 bytes, mask 7f: No change in performance detected.
time: [5.0371 ms 5.0487 ms 5.0618 ms]
change: [−0.5165% −0.1114% +0.2906%] (p = 0.59 > 0.05)
Found 14 outliers among 100 measurements (14.00%)
14 (14.00%) high severe
decode 4096 bytes, mask 3f: No change in performance detected.
time: [8.2722 µs 8.3105 µs 8.3530 µs]
change: [−0.2019% +0.2442% +0.7830%] (p = 0.32 > 0.05)
Found 19 outliers among 100 measurements (19.00%)
6 (6.00%) low mild
2 (2.00%) high mild
11 (11.00%) high severe
decode 1048576 bytes, mask 3f: No change in performance detected.
time: [1.5850 ms 1.5902 ms 1.5962 ms]
change: [−0.6684% −0.1223% +0.4020%] (p = 0.66 > 0.05)
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
1000 streams of 1 bytes/multistream: No change in performance detected.
time: [33.203 ns 39.555 ns 51.878 ns]
change: [+10.728% +32.674% +75.025%] (p = 0.06 > 0.05)
Found 3 outliers among 500 measurements (0.60%)
1 (0.20%) high mild
2 (0.40%) high severe
1000 streams of 1000 bytes/multistream: :broken_heart: Performance has regressed.
time: [34.055 ns 34.490 ns 34.929 ns]
change: [+12.649% +14.534% +16.408%] (p = 0.00 Found 1 outliers among 500 measurements (0.20%)
1 (0.20%) high severecoalesce_acked_from_zero 1+1 entries: No change in performance detected.
time: [88.115 ns 88.448 ns 88.789 ns]
change: [−0.4490% +0.5174% +1.7914%] (p = 0.49 > 0.05)
Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) high mild
4 (4.00%) high severe
coalesce_acked_from_zero 3+1 entries: No change in performance detected.
time: [105.48 ns 105.73 ns 105.99 ns]
change: [−0.8906% −0.3531% +0.1143%] (p = 0.18 > 0.05)
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
2 (2.00%) high mild
5 (5.00%) high severe
coalesce_acked_from_zero 10+1 entries: No change in performance detected.
time: [105.05 ns 105.38 ns 105.80 ns]
change: [−0.2958% +0.2637% +0.8589%] (p = 0.38 > 0.05)
Found 21 outliers among 100 measurements (21.00%)
4 (4.00%) low severe
6 (6.00%) low mild
3 (3.00%) high mild
8 (8.00%) high severe
coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
time: [88.820 ns 88.971 ns 89.126 ns]
change: [−0.7026% +0.2281% +1.1404%] (p = 0.65 > 0.05)
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
RxStreamOrderer::inbound_frame(): No change in performance detected.
time: [107.81 ms 107.97 ms 108.23 ms]
change: [−0.4699% −0.1075% +0.2218%] (p = 0.59 > 0.05)
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
sent::Packets::take_ranges: No change in performance detected.
time: [8.0612 µs 8.2610 µs 8.4441 µs]
change: [−0.7407% +5.8984% +17.161%] (p = 0.24 > 0.05)
Found 20 outliers among 100 measurements (20.00%)
4 (4.00%) low severe
11 (11.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
transfer/pacing-false/varying-seeds: :broken_heart: Performance has regressed.
time: [37.072 ms 37.169 ms 37.279 ms]
change: [+4.5101% +4.8981% +5.2577%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severetransfer/pacing-true/varying-seeds: :broken_heart: Performance has regressed.
time: [37.692 ms 37.808 ms 37.931 ms]
change: [+5.0829% +5.4940% +5.9561%] (p = 0.00 Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severetransfer/pacing-false/same-seed: :broken_heart: Performance has regressed.
time: [36.999 ms 37.067 ms 37.140 ms]
change: [+4.6033% +4.8770% +5.1647%] (p = 0.00 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severetransfer/pacing-true/same-seed: :broken_heart: Performance has regressed.
time: [38.372 ms 38.472 ms 38.576 ms]
change: [+4.1365% +4.4851% +4.8031%] (p = 0.00 Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severeClient/server transfer results
Performance differences relative to 95f9bedb40bc852f5f62b611ad6b2fd22c636843.
Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
| Client vs. server (params) | Mean ± σ | Min | Max | MiB/s ± σ | Δ main |
Δ main |
|---|---|---|---|---|---|---|
| google vs. google | 451.8 ± 4.7 | 444.9 | 461.2 | 70.8 ± 6.8 | ||
| google vs. neqo (cubic, paced) | 268.6 ± 4.5 | 261.4 | 283.8 | 119.2 ± 7.1 | :green_heart: -49.9 | -15.7% |
| msquic vs. msquic | 133.0 ± 34.2 | 100.8 | 374.4 | 240.6 ± 0.9 | ||
| msquic vs. neqo (cubic, paced) | 145.8 ± 16.7 | 121.6 | 225.1 | 219.5 ± 1.9 | :green_heart: -125.9 | -46.3% |
| neqo vs. google (cubic, paced) | 751.4 ± 4.5 | 743.5 | 769.3 | 42.6 ± 7.1 | -0.5 | -0.1% |
| neqo vs. msquic (cubic, paced) | 155.6 ± 5.0 | 147.3 | 176.0 | 205.6 ± 6.4 | -0.6 | -0.4% |
| neqo vs. neqo (cubic) | 90.0 ± 4.7 | 78.9 | 105.0 | 355.7 ± 6.8 | :green_heart: -121.0 | -57.4% |
| neqo vs. neqo (cubic, paced) | 90.2 ± 4.0 | 82.7 | 99.1 | 354.7 ± 8.0 | :green_heart: -121.0 | -57.3% |
| neqo vs. neqo (reno) | 90.8 ± 5.2 | 80.3 | 108.5 | 352.5 ± 6.2 | :green_heart: -118.3 | -56.6% |
| neqo vs. neqo (reno, paced) | 93.2 ± 5.3 | 82.0 | 113.0 | 343.2 ± 6.0 | :green_heart: -116.8 | -55.6% |
| neqo vs. quiche (cubic, paced) | 191.7 ± 4.2 | 185.4 | 202.1 | 167.0 ± 7.6 | :broken_heart: 2.3 | 1.2% |
| neqo vs. s2n (cubic, paced) | 217.8 ± 4.6 | 210.3 | 225.9 | 146.9 ± 7.0 | 1.1 | 0.5% |
| quiche vs. neqo (cubic, paced) | 157.6 ± 5.8 | 146.1 | 183.5 | 203.1 ± 5.5 | :green_heart: -590.4 | -78.9% |
| quiche vs. quiche | 147.0 ± 4.9 | 137.7 | 164.8 | 217.6 ± 6.5 | ||
| s2n vs. neqo (cubic, paced) | 172.1 ± 5.0 | 161.3 | 183.3 | 186.0 ± 6.4 | :green_heart: -126.3 | -42.3% |
| s2n vs. s2n | 248.2 ± 27.7 | 230.3 | 345.1 | 128.9 ± 1.2 |
Download data for profiler.firefox.com or download performance comparison data.
Optimized Upload only thus far.
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
time: [1.2891 s 1.2983 s 1.3077 s] thrpt: [76.469 MiB/s 77.023 MiB/s 77.571 MiB/s]change: time: [-32.828% -31.716% -30.597%] (p = 0.00 < 0.05) thrpt: [+44.086% +46.447% +48.872%]
Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild
:tada: matches https://github.com/mozilla/neqo/pull/2532#issuecomment-2758283036.
Introduced the same optimizations to neqo-server. In addition I removed the memory copy, now allocating each datagram of a GSO train into a single contiguous Vec right away. Result looks promising.
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
time: [245.19 ms 245.65 ms 246.12 ms] thrpt: [406.30 MiB/s 407.09 MiB/s 407.85 MiB/s]change: time: [-66.225% -66.008% -65.788%] (p = 0.00 < 0.05) thrpt: [+192.30% +194.19% +196.08%]
Why do we see a massive benefit in the client/server tests, but not in the transfer benches?
@larseggert the neqo-transport/bench/transfer.rs benchmarks use the test-fixtures/src/sim Simulator. The Simulator only processes a single datagram at a time.
https://github.com/mozilla/neqo/blob/37c3aeebb79aef9f9649c54c3bbfae84fee523b3/test-fixture/src/sim/mod.rs#L206
Let me see whether I can change that as part of this pull request. After all our benchmarks and tests should mirror how we run Neqo in Firefox as close as possible.
Codecov Report
Attention: Patch coverage is 95.93810% with 21 lines in your changes missing coverage. Please review.
Project coverage is 95.56%. Comparing base (
d16866a) to head (79a3266).
Additional details and impacted files
@@ Coverage Diff @@
## main #2593 +/- ##
==========================================
- Coverage 95.59% 95.56% -0.03%
==========================================
Files 115 115
Lines 37712 37956 +244
Branches 37712 37956 +244
==========================================
+ Hits 36049 36271 +222
- Misses 1657 1680 +23
+ Partials 6 5 -1
| Components | Coverage Δ | |
|---|---|---|
| neqo-common | 97.42% <92.72%> (-0.38%) |
:arrow_down: |
| neqo-crypto | 90.49% <ø> (ø) |
|
| neqo-http3 | 94.50% <100.00%> (+0.01%) |
:arrow_up: |
| neqo-qpack | 96.28% <ø> (ø) |
|
| neqo-transport | 96.53% <97.97%> (-0.03%) |
:arrow_down: |
| neqo-udp | 90.53% <82.85%> (-1.42%) |
:arrow_down: |
This pull request is ready for review. Note that benchmark results are inaccurate due to https://github.com/mozilla/neqo/pull/2743.
This pull request is ready for review. Note that benchmark results are inaccurate due to #2743.
I'm merging #2743 now, so we can get fresh/correct bench data.
Thanks for the quick review!
For the record, MacOS failure is due to different EMSGSIZE handling in quinn-udp (https://github.com/quinn-rs/quinn/pull/2199). Will push a simple patch.
Bencher Report
| Branch | gso-v3 |
| Testbed | t-linux64-ms-280 |
🚨 6 Alerts
| Benchmark | Measure Units | View | Benchmark Result (Result Δ%) | Upper Boundary (Limit %) |
|---|---|---|---|---|
| coalesce_acked_from_zero 10+1 entries | Latency nanoseconds (ns) | 📈 plot 🚷 threshold 🚨 alert (🔔) | 107.63 ns(+1.59%)Baseline: 105.94 ns | 106.97 ns (100.62%) |
| coalesce_acked_from_zero 1000+1 entries | Latency nanoseconds (ns) | 📈 plot 🚷 threshold 🚨 alert (🔔) | 95.73 ns(+7.14%)Baseline: 89.35 ns | 92.15 ns (103.88%) |
| transfer/pacing-false/same-seed | Latency milliseconds (ms) | 📈 plot 🚷 threshold 🚨 alert (🔔) | 36.89 ms(+5.37%)Baseline: 35.01 ms | 36.81 ms (100.21%) |
| transfer/pacing-false/varying-seeds | Latency milliseconds (ms) | 📈 plot 🚷 threshold 🚨 alert (🔔) | 37.28 ms(+6.05%)Baseline: 35.15 ms | 36.99 ms (100.78%) |
| transfer/pacing-true/same-seed | Latency milliseconds (ms) | 📈 plot 🚷 threshold 🚨 alert (🔔) | 38.66 ms(+5.61%)Baseline: 36.61 ms | 38.32 ms (100.88%) |
| transfer/pacing-true/varying-seeds | Latency milliseconds (ms) | 📈 plot 🚷 threshold 🚨 alert (🔔) | 38.16 ms(+6.04%)Baseline: 35.98 ms | 37.73 ms (101.12%) |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result nanoseconds (ns) (Result Δ%) | Upper Boundary nanoseconds (ns) (Limit %) |
|---|---|---|---|
| 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client | 📈 view plot 🚷 view threshold | 640,320,000.00 ns(-3.54%)Baseline: 663,820,375.00 ns | 728,934,998.14 ns (87.84%) |
| 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client | 📈 view plot 🚷 view threshold | 200,420,000.00 ns(-68.10%)Baseline: 628,188,375.00 ns | 852,579,930.08 ns (23.51%) |
| 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client | 📈 view plot 🚷 view threshold | 27,471,000.00 ns(+1.07%)Baseline: 27,180,837.50 ns | 27,655,811.64 ns (99.33%) |
| 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client | 📈 view plot 🚷 view threshold | 309,930,000.00 ns(+1.66%)Baseline: 304,879,875.00 ns | 315,750,492.21 ns (98.16%) |
| 1000 streams of 1 bytes/multistream | 📈 view plot 🚷 view threshold | 38.32 ns(+3.24%)Baseline: 37.11 ns | 53.66 ns (71.41%) |
| 1000 streams of 1000 bytes/multistream | 📈 view plot 🚷 view threshold | 38.63 ns(+5.26%)Baseline: 36.70 ns | 53.25 ns (72.55%) |
| RxStreamOrderer::inbound_frame() | 📈 view plot 🚷 view threshold | 109,970,000.00 ns(-0.47%)Baseline: 110,483,762.50 ns | 114,400,196.45 ns (96.13%) |
| coalesce_acked_from_zero 1+1 entries | 📈 view plot 🚷 view threshold | 88.51 ns(-0.19%)Baseline: 88.68 ns | 89.29 ns (99.13%) |
| coalesce_acked_from_zero 10+1 entries | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 107.63 ns(+1.59%)Baseline: 105.94 ns | 106.97 ns (100.62%) |
| coalesce_acked_from_zero 1000+1 entries | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 95.73 ns(+7.14%)Baseline: 89.35 ns | 92.15 ns (103.88%) |
| coalesce_acked_from_zero 3+1 entries | 📈 view plot 🚷 view threshold | 106.31 ns(-0.19%)Baseline: 106.51 ns | 107.36 ns (99.02%) |
| decode 1048576 bytes, mask 3f | 📈 view plot 🚷 view threshold | 1,596,600.00 ns(-1.11%)Baseline: 1,614,482.50 ns | 1,757,188.61 ns (90.86%) |
| decode 1048576 bytes, mask 7f | 📈 view plot 🚷 view threshold | 5,060,900.00 ns(-0.05%)Baseline: 5,063,310.00 ns | 5,089,343.37 ns (99.44%) |
| decode 1048576 bytes, mask ff | 📈 view plot 🚷 view threshold | 3,031,600.00 ns(-0.12%)Baseline: 3,035,103.75 ns | 3,066,146.08 ns (98.87%) |
| decode 4096 bytes, mask 3f | 📈 view plot 🚷 view threshold | 8,273.50 ns(+4.04%)Baseline: 7,952.18 ns | 10,113.99 ns (81.80%) |
| decode 4096 bytes, mask 7f | 📈 view plot 🚷 view threshold | 20,017.00 ns(+0.46%)Baseline: 19,924.92 ns | 20,394.59 ns (98.15%) |
| decode 4096 bytes, mask ff | 📈 view plot 🚷 view threshold | 11,841.00 ns(+0.20%)Baseline: 11,817.16 ns | 11,970.83 ns (98.92%) |
| sent::Packets::take_ranges | 📈 view plot 🚷 view threshold | 8,447.90 ns(+0.24%)Baseline: 8,427.93 ns | 8,481.15 ns (99.61%) |
| transfer/pacing-false/same-seed | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 36,886,000.00 ns(+5.37%)Baseline: 35,006,825.00 ns | 36,809,827.13 ns (100.21%) |
| transfer/pacing-false/varying-seeds | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 37,278,000.00 ns(+6.05%)Baseline: 35,150,025.00 ns | 36,990,430.22 ns (100.78%) |
| transfer/pacing-true/same-seed | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 38,661,000.00 ns(+5.61%)Baseline: 36,608,950.00 ns | 38,323,633.10 ns (100.88%) |
| transfer/pacing-true/varying-seeds | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 38,156,000.00 ns(+6.04%)Baseline: 35,982,850.00 ns | 37,732,380.19 ns (101.12%) |
For the record, MacOS failure is due to different
EMSGSIZEhandling inquinn-udp(quinn-rs/quinn#2199). Will push a simple patch.
Will be fixed in https://github.com/mozilla/neqo/pull/2746.
Bencher Report
| Branch | gso-v3 |
| Testbed | t-linux64-ms-280 |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| s2n vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 215.73 ms(-30.65%)Baseline: 311.08 ms | 349.46 ms (61.73%) |
@mxinden I fixed up the doc comments, but there are still Windows test failures.
@mxinden tests::send_ignore_emsgsize still failing on Windows.
Bencher Report
| Branch | gso-v3 |
| Testbed | t-linux64-ms-279 |
Click to view all benchmark results
Bencher Report
| Branch | gso-v3 |
| Testbed | t-linux64-ms-279 |
Click to view all benchmark results
| Benchmark | Latency | milliseconds (ms) |
|---|---|---|
| s2n vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 210.06 ms |
Bencher Report
| Branch | gso-v3 |
| Testbed | t-linux64-ms-278 |
Click to view all benchmark results
Bencher Report
| Branch | gso-v3 |
| Testbed | t-linux64-ms-278 |
Click to view all benchmark results
| Benchmark | Latency | milliseconds (ms) |
|---|---|---|
| s2n vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 172.07 ms |
@mxinden is this ready to merge?
Yes, ready to merge from my end. We have a couple of benchmark regressions. Explainer for each:
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
time: [650.03 ms 655.28 ms 660.94 ms] thrpt: [151.30 MiB/s 152.61 MiB/s 153.84 MiB/s]change: time: [−27.566% −26.708% −25.736%] (p = 0.00 < 0.05) thrpt: [+34.655% +36.441% +38.056%]
This will improve even further with https://github.com/mozilla/neqo/pull/2734.
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.
time: [27.404 ms 27.500 ms 27.616 ms] thrpt: [36.211 elem/s 36.363 elem/s 36.491 elem/s]change: time: [+1.5705% +2.1502% +2.7519%] (p = 0.00 < 0.05) thrpt: [−2.6782% −2.1049% −1.5463%]
This is expected. We pay a slight cost in latency when sending in batches.
1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.
time: [36.454 ns 36.834 ns 37.215 ns] change: [+25.596% +27.533% +29.527%] (p = 0.00 < 0.05)
This should be due to neqo-http3/benches/streams.rs not using the batched IO paths. Instead of altering the IO handling in the benchmark, I suggest we do https://github.com/mozilla/neqo/issues/2728. Given that the benchmark measures stream performance and not UDP IO performance, I suggest doing this in a follow-up.
transfer/pacing-false/varying-seeds: 💔 Performance has regressed.
time: [36.886 ms 36.956 ms 37.027 ms] change: [+4.0332% +4.3753% +4.6740%] (p = 0.00 < 0.05)
Again, slight regression as the Simulator is not using the batched IO paths. The non-batched IO path (i.e. process), now no-longer pre-allocate, as we don't know the datagram size ahead of time. Once https://github.com/mozilla/neqo/pull/2747 is merged, this overhead should be reduced, as we would write datagrams into a long-lived buffer.
@larseggert let me know whether you are fine proceeding here, or would prefer any of the above to be addressed first.
I'll merge now; please do issues for the missing bits?
Great we can land this!
This keeps getting kicked out of the merge queue while tests are still running and haven't failed yet. I think GitHub may have issues. Doing a force merge.
please do issues for the missing bits?
I assume you are fine with the following pull requests tracking the progress. Let me know if you want additional GitHub issues.
- https://github.com/mozilla/neqo/pull/2734
- https://github.com/mozilla/neqo/issues/2728
- https://github.com/mozilla/neqo/pull/2747
Early numbers on GSO in Firefox Nightly:
- ~5% of sends on Linux and Windows use GSO with 2 or more segments
- ~5% of sends on Linux and Windows send 2.4 k bytes or more
- We currently limit number of segments to 10, which is reflected in the metrics (apart from some crazy machine on Linux doing > 100)
Good signals. We should explore increasing max number of segments (currently 10). Maybe just limit by what our pacer allows to send.
Datagram (batch) size
Windows
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Windows&visiblePercentiles=%5B99%2C95%2C75%2C50%2C25%2C5%5D
Linux
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D
Number of segments in a batch
Windows
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Windows&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D
Linux
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D
Yes, let's increase.