neqo feat(udp): use sendmmsg and recvmmsg

Write and read up to BATCH_SIZE datagrams with single sendmmsg and recvmmsg syscall.

Previously neqo_bin::udp::Socket::send would use sendmmsg, but provide a single buffer to write into only, effectively using sendmsg instead of sendmmsg. Same with Socket::recv.

With this commit Socket::send provides BATCH_SIZE number of buffers on each sendmmsg syscall, thus writing more than one datagram at a time if available. Same with Socket::recv.

Part of https://github.com/mozilla/neqo/issues/1693.

Mar 13 '24 14:03 mxinden

Benchmark results

Performance differences relative to 3e537097cbd6e75b861d0d859f9c5eb9c233a723.

drain a timer quickly time: [434.66 ns 441.22 ns 447.54 ns] change: [-0.6964% +0.8250% +2.9072%] (p = 0.38 > 0.05) No change in performance detected.
coalesce_acked_from_zero 1+1 entries time: [196.62 ns 197.11 ns 197.61 ns] change: [-0.8171% -0.4829% -0.1380%] (p = 0.01 < 0.05) Change within noise threshold.
coalesce_acked_from_zero 3+1 entries time: [240.43 ns 241.00 ns 241.64 ns] change: [-0.5382% -0.0751% +0.3426%] (p = 0.76 > 0.05) No change in performance detected.
coalesce_acked_from_zero 10+1 entries time: [239.00 ns 239.69 ns 240.53 ns] change: [-1.3511% -0.5261% +0.0766%] (p = 0.17 > 0.05) No change in performance detected.
coalesce_acked_from_zero 1000+1 entries time: [216.68 ns 216.96 ns 217.27 ns] change: [-0.7531% -0.1853% +0.3917%] (p = 0.55 > 0.05) No change in performance detected.
RxStreamOrderer::inbound_frame() time: [119.87 ms 119.94 ms 120.00 ms] change: [+0.4749% +0.6968% +0.8697%] (p = 0.00 < 0.05) Change within noise threshold.
transfer/Run multiple transfers with varying seeds time: [118.08 ms 118.39 ms 118.71 ms] thrpt: [33.696 MiB/s 33.786 MiB/s 33.875 MiB/s] change: time: [-3.4361% -3.1033% -2.7586%] (p = 0.00 < 0.05) thrpt: [+2.8369% +3.2027% +3.5584%] Change within noise threshold.
transfer/Run multiple transfers with the same seed time: [118.90 ms 119.07 ms 119.25 ms] thrpt: [33.542 MiB/s 33.594 MiB/s 33.642 MiB/s] change: time: [-2.7935% -2.5730% -2.3522%] (p = 0.00 < 0.05) thrpt: [+2.4089% +2.6410% +2.8738%] Change within noise threshold.
1-conn/1-100mb-resp (aka. Download)/client time: [1.0852 s 1.0898 s 1.0946 s] thrpt: [91.361 MiB/s 91.760 MiB/s 92.150 MiB/s] change: time: [-2.3395% -0.9694% +0.1106%] (p = 0.16 > 0.05) thrpt: [-0.1105% +0.9789% +2.3956%] No change in performance detected.
1-conn/10_000-parallel-1b-resp (aka. RPS)/client time: [542.73 ms 546.39 ms 550.03 ms] thrpt: [18.181 Kelem/s 18.302 Kelem/s 18.425 Kelem/s] change: time: [+26.586% +27.556% +28.610%] (p = 0.00 < 0.05) thrpt: [-22.245% -21.603% -21.002%] :broken_heart: Performance has regressed.
1-conn/1-1b-resp (aka. HPS)/client time: [52.299 ms 52.653 ms 53.052 ms] thrpt: [18.849 elem/s 18.992 elem/s 19.121 elem/s] change: time: [+2.6478% +3.6365% +4.7018%] (p = 0.00 < 0.05) thrpt: [-4.4906% -3.5089% -2.5795%] :broken_heart: Performance has regressed.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]	Relative
msquic	msquic			671.8 ± 293.0	379.2	1321.6	1.00
neqo	msquic	reno	on	908.9 ± 211.4	713.1	1268.5	1.00
neqo	msquic	reno		873.0 ± 170.4	713.5	1221.6	1.00
neqo	msquic	cubic	on	958.8 ± 398.9	697.1	1869.5	1.00
neqo	msquic	cubic		832.3 ± 127.4	710.1	1009.0	1.00
msquic	neqo	reno	on	3230.5 ± 230.0	2967.8	3778.0	1.00
msquic	neqo	reno		3232.8 ± 200.1	2888.6	3582.1	1.00
msquic	neqo	cubic	on	3478.9 ± 224.7	3270.5	3986.4	1.00
msquic	neqo	cubic		3262.9 ± 74.2	3161.2	3363.7	1.00
neqo	neqo	reno	on	2627.0 ± 131.2	2445.5	2849.8	1.00
neqo	neqo	reno		2537.0 ± 227.2	2377.2	3146.6	1.00
neqo	neqo	cubic	on	3032.9 ± 272.3	2734.0	3479.1	1.00
neqo	neqo	cubic		2626.5 ± 205.8	2403.5	3060.2	1.00

:arrow_down: Download logs

Mar 15 '24 10:03 github-actions[bot]

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 93.22%. Comparing base (3e53709) to head (38fca6b).

Files	Patch %	Lines
neqo-http3/src/send_message.rs	50.00%	5 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1741      +/-   ##
==========================================
- Coverage   93.26%   93.22%   -0.04%     
==========================================
  Files         110      110              
  Lines       35669    35680      +11     
==========================================
- Hits        33266    33262       -4     
- Misses       2403     2418      +15

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 15 '24 11:03 codecov[bot]

Status update:

Thus far I am having a hard time finding clear signals that this is a performance improvement. Not saying it isn't. Just missing a proof.

Duration of client/server transfers on benchmark machine vary significantly (30s - 45s) across CI runs. Same for I/O syscalls samples in perf traces.

Maybe worth hooking into something like criterion to get stable metrics and filter out outliers. I will investigate further. Suggestions welcome.

Mar 15 '24 18:03 mxinden

Remember that pacing interferes here. You might want to log how large the batches are that you are actually doing on TX/RX. I think @KershawChang had a feature to turn off pacing, which should increase those batches.

As discussed elsewhere, we likely have other bugs/bottlenecks in the code that limit the impact this change will have, too.

Mar 15 '24 21:03 larseggert

Thank you for the input.

You might want to log how large the batches are that you are actually doing on TX/RX

Will do.

Remember that pacing interferes here.

It is actually disabled by default, though by mistake. Thanks for the hint! See https://github.com/mozilla/neqo/pull/1753.

Mar 16 '24 13:03 mxinden

Small update:

Solely using recvmmsg (this pull request) without sendmmsg or any buffer optimizations does not show any performance improvements. Quite the opposite, our benchmarks report regressions.

I hacked together a sendmmsg implementation (not yet pushed). The results (sendmmsg + recvmmsg) look promising:

➜  neqo-bin git:(main) ✗ critcmp main sendmmsg                                            
group                                          main                                     sendmmsg
-----                                          ----                                     --------
1-conn/1-100mb-resp (aka. Download)/client     1.09    541.4±5.05ms   184.7 MB/sec      1.00  494.7±213.29ms   202.1 MB/sec

Apr 03 '24 17:04 mxinden

recvmmsg without sendmmsg will have no benefit, since without the latter batches will be very short.

Apr 03 '24 17:04 larseggert

1-conn/1-100mb-resp (aka. Download)/client time: [781.71 ms 787.17 ms 792.92 ms] thrpt: [126.12 MiB/s 127.04 MiB/s 127.93 MiB/s] change: time: [-30.837% -28.888% -26.932%] (p = 0.00 < 0.05) thrpt: [+36.858% +40.623% +44.586%] 💚 Performance has improved.

:tada: 40% improvement in throughput. First (?) time breaking 1 Gbit/s on CI.

Apr 04 '24 08:04 mxinden

1-conn/10_000-1b-seq-resp (aka. RPS)/client time: [446.15 ms 448.43 ms 450.77 ms] thrpt: [22.184 Kelem/s 22.300 Kelem/s 22.414 Kelem/s] change: time: [+14.829% +15.832% +16.877%] (p = 0.00 < 0.05) thrpt: [-14.440% -13.668% -12.914%] 💔 Performance has regressed.

100-seq-conn/1-1b-resp (aka. HPS)/client time: [3.6042 s 3.6064 s 3.6087 s] thrpt: [27.711 elem/s 27.728 elem/s 27.745 elem/s] change: time: [+6.1249% +6.2317% +6.3299%] (p = 0.00 < 0.05) thrpt: [-5.9531% -5.8662% -5.7714%] 💔 Performance has regressed.

Currently investigating why performance regresses here. It does not seem to be tied to the BATCH_SIZE. Setting BATCH_SIZE to 1, effectively disabling sendmmsg and recvmmsg, does not improve the RPS benchmark:

➜  neqo-bin git:(recvmmsg) ✗ critcmp main recvmmsg-batch-32 recvmmsg-batch-1 -f "RPS"                            
group                                          main                                   recvmmsg-batch-1                       recvmmsg-batch-32
-----                                          ----                                   ----------------                       -----------------
1-conn/10_000-1b-seq-resp (aka. RPS)/client    1.00   122.8±16.81ms 79.5 KElem/sec    1.24   152.6±17.01ms 64.0 KElem/sec    1.24   152.6±23.00ms 64.0 KElem/sec

Apr 05 '24 10:04 mxinden

@mxinden is this OBE now?

Jun 18 '24 10:06 larseggert

Sorry for the delay here. I plan to extract parts of this pull request for https://github.com/mozilla/neqo/issues/1693. That said, no need for a draft pull request, I can just cherry-pick off of my branch.

Jul 01 '24 18:07 mxinden