feat(udp): use sendmmsg and recvmmsg
Write and read up to BATCH_SIZE datagrams with single sendmmsg and recvmmsg syscall.
Previously neqo_bin::udp::Socket::send would use sendmmsg, but provide a single buffer to write into only, effectively using sendmsg instead of sendmmsg. Same with Socket::recv.
With this commit Socket::send provides BATCH_SIZE number of buffers on each sendmmsg syscall, thus writing more than one datagram at a time if available. Same with Socket::recv.
Part of https://github.com/mozilla/neqo/issues/1693.
Benchmark results
Performance differences relative to 3e537097cbd6e75b861d0d859f9c5eb9c233a723.
-
drain a timer quickly time: [434.66 ns 441.22 ns 447.54 ns] change: [-0.6964% +0.8250% +2.9072%] (p = 0.38 > 0.05) No change in performance detected.
-
coalesce_acked_from_zero 1+1 entries time: [196.62 ns 197.11 ns 197.61 ns] change: [-0.8171% -0.4829% -0.1380%] (p = 0.01 < 0.05) Change within noise threshold.
-
coalesce_acked_from_zero 3+1 entries time: [240.43 ns 241.00 ns 241.64 ns] change: [-0.5382% -0.0751% +0.3426%] (p = 0.76 > 0.05) No change in performance detected.
-
coalesce_acked_from_zero 10+1 entries time: [239.00 ns 239.69 ns 240.53 ns] change: [-1.3511% -0.5261% +0.0766%] (p = 0.17 > 0.05) No change in performance detected.
-
coalesce_acked_from_zero 1000+1 entries time: [216.68 ns 216.96 ns 217.27 ns] change: [-0.7531% -0.1853% +0.3917%] (p = 0.55 > 0.05) No change in performance detected.
-
RxStreamOrderer::inbound_frame() time: [119.87 ms 119.94 ms 120.00 ms] change: [+0.4749% +0.6968% +0.8697%] (p = 0.00 < 0.05) Change within noise threshold.
-
transfer/Run multiple transfers with varying seeds time: [118.08 ms 118.39 ms 118.71 ms] thrpt: [33.696 MiB/s 33.786 MiB/s 33.875 MiB/s] change: time: [-3.4361% -3.1033% -2.7586%] (p = 0.00 < 0.05) thrpt: [+2.8369% +3.2027% +3.5584%] Change within noise threshold.
-
transfer/Run multiple transfers with the same seed time: [118.90 ms 119.07 ms 119.25 ms] thrpt: [33.542 MiB/s 33.594 MiB/s 33.642 MiB/s] change: time: [-2.7935% -2.5730% -2.3522%] (p = 0.00 < 0.05) thrpt: [+2.4089% +2.6410% +2.8738%] Change within noise threshold.
-
1-conn/1-100mb-resp (aka. Download)/client time: [1.0852 s 1.0898 s 1.0946 s] thrpt: [91.361 MiB/s 91.760 MiB/s 92.150 MiB/s] change: time: [-2.3395% -0.9694% +0.1106%] (p = 0.16 > 0.05) thrpt: [-0.1105% +0.9789% +2.3956%] No change in performance detected.
-
1-conn/10_000-parallel-1b-resp (aka. RPS)/client time: [542.73 ms 546.39 ms 550.03 ms] thrpt: [18.181 Kelem/s 18.302 Kelem/s 18.425 Kelem/s] change: time: [+26.586% +27.556% +28.610%] (p = 0.00 < 0.05) thrpt: [-22.245% -21.603% -21.002%] :broken_heart: Performance has regressed.
-
1-conn/1-1b-resp (aka. HPS)/client time: [52.299 ms 52.653 ms 53.052 ms] thrpt: [18.849 elem/s 18.992 elem/s 19.121 elem/s] change: time: [+2.6478% +3.6365% +4.7018%] (p = 0.00 < 0.05) thrpt: [-4.4906% -3.5089% -2.5795%] :broken_heart: Performance has regressed.
Client/server transfer results
Transfer of 134217728 bytes over loopback.
| Client | Server | CC | Pacing | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|---|---|---|
| msquic | msquic | 671.8 ± 293.0 | 379.2 | 1321.6 | 1.00 | ||
| neqo | msquic | reno | on | 908.9 ± 211.4 | 713.1 | 1268.5 | 1.00 |
| neqo | msquic | reno | 873.0 ± 170.4 | 713.5 | 1221.6 | 1.00 | |
| neqo | msquic | cubic | on | 958.8 ± 398.9 | 697.1 | 1869.5 | 1.00 |
| neqo | msquic | cubic | 832.3 ± 127.4 | 710.1 | 1009.0 | 1.00 | |
| msquic | neqo | reno | on | 3230.5 ± 230.0 | 2967.8 | 3778.0 | 1.00 |
| msquic | neqo | reno | 3232.8 ± 200.1 | 2888.6 | 3582.1 | 1.00 | |
| msquic | neqo | cubic | on | 3478.9 ± 224.7 | 3270.5 | 3986.4 | 1.00 |
| msquic | neqo | cubic | 3262.9 ± 74.2 | 3161.2 | 3363.7 | 1.00 | |
| neqo | neqo | reno | on | 2627.0 ± 131.2 | 2445.5 | 2849.8 | 1.00 |
| neqo | neqo | reno | 2537.0 ± 227.2 | 2377.2 | 3146.6 | 1.00 | |
| neqo | neqo | cubic | on | 3032.9 ± 272.3 | 2734.0 | 3479.1 | 1.00 |
| neqo | neqo | cubic | 2626.5 ± 205.8 | 2403.5 | 3060.2 | 1.00 |
Codecov Report
Attention: Patch coverage is 73.68421% with 5 lines in your changes are missing coverage. Please review.
Project coverage is 93.22%. Comparing base (
3e53709) to head (38fca6b).
| Files | Patch % | Lines |
|---|---|---|
| neqo-http3/src/send_message.rs | 50.00% | 5 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #1741 +/- ##
==========================================
- Coverage 93.26% 93.22% -0.04%
==========================================
Files 110 110
Lines 35669 35680 +11
==========================================
- Hits 33266 33262 -4
- Misses 2403 2418 +15
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Status update:
Thus far I am having a hard time finding clear signals that this is a performance improvement. Not saying it isn't. Just missing a proof.
Duration of client/server transfers on benchmark machine vary significantly (30s - 45s) across CI runs. Same for I/O syscalls samples in perf traces.
Maybe worth hooking into something like criterion to get stable metrics and filter out outliers. I will investigate further. Suggestions welcome.
Remember that pacing interferes here. You might want to log how large the batches are that you are actually doing on TX/RX. I think @KershawChang had a feature to turn off pacing, which should increase those batches.
As discussed elsewhere, we likely have other bugs/bottlenecks in the code that limit the impact this change will have, too.
Thank you for the input.
You might want to log how large the batches are that you are actually doing on TX/RX
Will do.
Remember that pacing interferes here.
It is actually disabled by default, though by mistake. Thanks for the hint! See https://github.com/mozilla/neqo/pull/1753.
Small update:
Solely using recvmmsg (this pull request) without sendmmsg or any buffer optimizations does not show any performance improvements. Quite the opposite, our benchmarks report regressions.
I hacked together a sendmmsg implementation (not yet pushed). The results (sendmmsg + recvmmsg) look promising:
➜ neqo-bin git:(main) ✗ critcmp main sendmmsg
group main sendmmsg
----- ---- --------
1-conn/1-100mb-resp (aka. Download)/client 1.09 541.4±5.05ms 184.7 MB/sec 1.00 494.7±213.29ms 202.1 MB/sec
recvmmsg without sendmmsg will have no benefit, since without the latter batches will be very short.
- 1-conn/1-100mb-resp (aka. Download)/client time: [781.71 ms 787.17 ms 792.92 ms] thrpt: [126.12 MiB/s 127.04 MiB/s 127.93 MiB/s] change: time: [-30.837% -28.888% -26.932%] (p = 0.00 < 0.05) thrpt: [+36.858% +40.623% +44.586%] 💚 Performance has improved.
:tada: 40% improvement in throughput. First (?) time breaking 1 Gbit/s on CI.
1-conn/10_000-1b-seq-resp (aka. RPS)/client time: [446.15 ms 448.43 ms 450.77 ms] thrpt: [22.184 Kelem/s 22.300 Kelem/s 22.414 Kelem/s] change: time: [+14.829% +15.832% +16.877%] (p = 0.00 < 0.05) thrpt: [-14.440% -13.668% -12.914%] 💔 Performance has regressed.
100-seq-conn/1-1b-resp (aka. HPS)/client time: [3.6042 s 3.6064 s 3.6087 s] thrpt: [27.711 elem/s 27.728 elem/s 27.745 elem/s] change: time: [+6.1249% +6.2317% +6.3299%] (p = 0.00 < 0.05) thrpt: [-5.9531% -5.8662% -5.7714%] 💔 Performance has regressed.
Currently investigating why performance regresses here. It does not seem to be tied to the BATCH_SIZE. Setting BATCH_SIZE to 1, effectively disabling sendmmsg and recvmmsg, does not improve the RPS benchmark:
➜ neqo-bin git:(recvmmsg) ✗ critcmp main recvmmsg-batch-32 recvmmsg-batch-1 -f "RPS"
group main recvmmsg-batch-1 recvmmsg-batch-32
----- ---- ---------------- -----------------
1-conn/10_000-1b-seq-resp (aka. RPS)/client 1.00 122.8±16.81ms 79.5 KElem/sec 1.24 152.6±17.01ms 64.0 KElem/sec 1.24 152.6±23.00ms 64.0 KElem/sec
@mxinden is this OBE now?
Sorry for the delay here. I plan to extract parts of this pull request for https://github.com/mozilla/neqo/issues/1693. That said, no need for a draft pull request, I can just cherry-pick off of my branch.