neqo icon indicating copy to clipboard operation
neqo copied to clipboard

feat(udp): use sendmmsg and recvmmsg

Open mxinden opened this issue 1 year ago • 9 comments

Write and read up to BATCH_SIZE datagrams with single sendmmsg and recvmmsg syscall.

Previously neqo_bin::udp::Socket::send would use sendmmsg, but provide a single buffer to write into only, effectively using sendmsg instead of sendmmsg. Same with Socket::recv.

With this commit Socket::send provides BATCH_SIZE number of buffers on each sendmmsg syscall, thus writing more than one datagram at a time if available. Same with Socket::recv.


Part of https://github.com/mozilla/neqo/issues/1693.

mxinden avatar Mar 13 '24 14:03 mxinden

Benchmark results

Performance differences relative to 3e537097cbd6e75b861d0d859f9c5eb9c233a723.

  • drain a timer quickly time: [434.66 ns 441.22 ns 447.54 ns] change: [-0.6964% +0.8250% +2.9072%] (p = 0.38 > 0.05) No change in performance detected.

  • coalesce_acked_from_zero 1+1 entries time: [196.62 ns 197.11 ns 197.61 ns] change: [-0.8171% -0.4829% -0.1380%] (p = 0.01 < 0.05) Change within noise threshold.

  • coalesce_acked_from_zero 3+1 entries time: [240.43 ns 241.00 ns 241.64 ns] change: [-0.5382% -0.0751% +0.3426%] (p = 0.76 > 0.05) No change in performance detected.

  • coalesce_acked_from_zero 10+1 entries time: [239.00 ns 239.69 ns 240.53 ns] change: [-1.3511% -0.5261% +0.0766%] (p = 0.17 > 0.05) No change in performance detected.

  • coalesce_acked_from_zero 1000+1 entries time: [216.68 ns 216.96 ns 217.27 ns] change: [-0.7531% -0.1853% +0.3917%] (p = 0.55 > 0.05) No change in performance detected.

  • RxStreamOrderer::inbound_frame() time: [119.87 ms 119.94 ms 120.00 ms] change: [+0.4749% +0.6968% +0.8697%] (p = 0.00 < 0.05) Change within noise threshold.

  • transfer/Run multiple transfers with varying seeds time: [118.08 ms 118.39 ms 118.71 ms] thrpt: [33.696 MiB/s 33.786 MiB/s 33.875 MiB/s] change: time: [-3.4361% -3.1033% -2.7586%] (p = 0.00 < 0.05) thrpt: [+2.8369% +3.2027% +3.5584%] Change within noise threshold.

  • transfer/Run multiple transfers with the same seed time: [118.90 ms 119.07 ms 119.25 ms] thrpt: [33.542 MiB/s 33.594 MiB/s 33.642 MiB/s] change: time: [-2.7935% -2.5730% -2.3522%] (p = 0.00 < 0.05) thrpt: [+2.4089% +2.6410% +2.8738%] Change within noise threshold.

  • 1-conn/1-100mb-resp (aka. Download)/client time: [1.0852 s 1.0898 s 1.0946 s] thrpt: [91.361 MiB/s 91.760 MiB/s 92.150 MiB/s] change: time: [-2.3395% -0.9694% +0.1106%] (p = 0.16 > 0.05) thrpt: [-0.1105% +0.9789% +2.3956%] No change in performance detected.

  • 1-conn/10_000-parallel-1b-resp (aka. RPS)/client time: [542.73 ms 546.39 ms 550.03 ms] thrpt: [18.181 Kelem/s 18.302 Kelem/s 18.425 Kelem/s] change: time: [+26.586% +27.556% +28.610%] (p = 0.00 < 0.05) thrpt: [-22.245% -21.603% -21.002%] :broken_heart: Performance has regressed.

  • 1-conn/1-1b-resp (aka. HPS)/client time: [52.299 ms 52.653 ms 53.052 ms] thrpt: [18.849 elem/s 18.992 elem/s 19.121 elem/s] change: time: [+2.6478% +3.6365% +4.7018%] (p = 0.00 < 0.05) thrpt: [-4.4906% -3.5089% -2.5795%] :broken_heart: Performance has regressed.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 671.8 ± 293.0 379.2 1321.6 1.00
neqo msquic reno on 908.9 ± 211.4 713.1 1268.5 1.00
neqo msquic reno 873.0 ± 170.4 713.5 1221.6 1.00
neqo msquic cubic on 958.8 ± 398.9 697.1 1869.5 1.00
neqo msquic cubic 832.3 ± 127.4 710.1 1009.0 1.00
msquic neqo reno on 3230.5 ± 230.0 2967.8 3778.0 1.00
msquic neqo reno 3232.8 ± 200.1 2888.6 3582.1 1.00
msquic neqo cubic on 3478.9 ± 224.7 3270.5 3986.4 1.00
msquic neqo cubic 3262.9 ± 74.2 3161.2 3363.7 1.00
neqo neqo reno on 2627.0 ± 131.2 2445.5 2849.8 1.00
neqo neqo reno 2537.0 ± 227.2 2377.2 3146.6 1.00
neqo neqo cubic on 3032.9 ± 272.3 2734.0 3479.1 1.00
neqo neqo cubic 2626.5 ± 205.8 2403.5 3060.2 1.00

:arrow_down: Download logs

github-actions[bot] avatar Mar 15 '24 10:03 github-actions[bot]

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 93.22%. Comparing base (3e53709) to head (38fca6b).

Files Patch % Lines
neqo-http3/src/send_message.rs 50.00% 5 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1741      +/-   ##
==========================================
- Coverage   93.26%   93.22%   -0.04%     
==========================================
  Files         110      110              
  Lines       35669    35680      +11     
==========================================
- Hits        33266    33262       -4     
- Misses       2403     2418      +15     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 15 '24 11:03 codecov[bot]

Status update:

Thus far I am having a hard time finding clear signals that this is a performance improvement. Not saying it isn't. Just missing a proof.

Duration of client/server transfers on benchmark machine vary significantly (30s - 45s) across CI runs. Same for I/O syscalls samples in perf traces.

Maybe worth hooking into something like criterion to get stable metrics and filter out outliers. I will investigate further. Suggestions welcome.

mxinden avatar Mar 15 '24 18:03 mxinden

Remember that pacing interferes here. You might want to log how large the batches are that you are actually doing on TX/RX. I think @KershawChang had a feature to turn off pacing, which should increase those batches.

As discussed elsewhere, we likely have other bugs/bottlenecks in the code that limit the impact this change will have, too.

larseggert avatar Mar 15 '24 21:03 larseggert

Thank you for the input.

You might want to log how large the batches are that you are actually doing on TX/RX

Will do.

Remember that pacing interferes here.

It is actually disabled by default, though by mistake. Thanks for the hint! See https://github.com/mozilla/neqo/pull/1753.

mxinden avatar Mar 16 '24 13:03 mxinden

Small update:

Solely using recvmmsg (this pull request) without sendmmsg or any buffer optimizations does not show any performance improvements. Quite the opposite, our benchmarks report regressions.

I hacked together a sendmmsg implementation (not yet pushed). The results (sendmmsg + recvmmsg) look promising:

➜  neqo-bin git:(main) ✗ critcmp main sendmmsg                                            
group                                          main                                     sendmmsg
-----                                          ----                                     --------
1-conn/1-100mb-resp (aka. Download)/client     1.09    541.4±5.05ms   184.7 MB/sec      1.00  494.7±213.29ms   202.1 MB/sec

mxinden avatar Apr 03 '24 17:04 mxinden

recvmmsg without sendmmsg will have no benefit, since without the latter batches will be very short.

larseggert avatar Apr 03 '24 17:04 larseggert

  • 1-conn/1-100mb-resp (aka. Download)/client time: [781.71 ms 787.17 ms 792.92 ms] thrpt: [126.12 MiB/s 127.04 MiB/s 127.93 MiB/s] change: time: [-30.837% -28.888% -26.932%] (p = 0.00 < 0.05) thrpt: [+36.858% +40.623% +44.586%] 💚 Performance has improved.

:tada: 40% improvement in throughput. First (?) time breaking 1 Gbit/s on CI.

mxinden avatar Apr 04 '24 08:04 mxinden

  • 1-conn/10_000-1b-seq-resp (aka. RPS)/client time: [446.15 ms 448.43 ms 450.77 ms] thrpt: [22.184 Kelem/s 22.300 Kelem/s 22.414 Kelem/s] change: time: [+14.829% +15.832% +16.877%] (p = 0.00 < 0.05) thrpt: [-14.440% -13.668% -12.914%] 💔 Performance has regressed.

  • 100-seq-conn/1-1b-resp (aka. HPS)/client time: [3.6042 s 3.6064 s 3.6087 s] thrpt: [27.711 elem/s 27.728 elem/s 27.745 elem/s] change: time: [+6.1249% +6.2317% +6.3299%] (p = 0.00 < 0.05) thrpt: [-5.9531% -5.8662% -5.7714%] 💔 Performance has regressed.

Currently investigating why performance regresses here. It does not seem to be tied to the BATCH_SIZE. Setting BATCH_SIZE to 1, effectively disabling sendmmsg and recvmmsg, does not improve the RPS benchmark:

➜  neqo-bin git:(recvmmsg) ✗ critcmp main recvmmsg-batch-32 recvmmsg-batch-1 -f "RPS"                            
group                                          main                                   recvmmsg-batch-1                       recvmmsg-batch-32
-----                                          ----                                   ----------------                       -----------------
1-conn/10_000-1b-seq-resp (aka. RPS)/client    1.00   122.8±16.81ms 79.5 KElem/sec    1.24   152.6±17.01ms 64.0 KElem/sec    1.24   152.6±23.00ms 64.0 KElem/sec

mxinden avatar Apr 05 '24 10:04 mxinden

@mxinden is this OBE now?

larseggert avatar Jun 18 '24 10:06 larseggert

Sorry for the delay here. I plan to extract parts of this pull request for https://github.com/mozilla/neqo/issues/1693. That said, no need for a draft pull request, I can just cherry-pick off of my branch.

mxinden avatar Jul 01 '24 18:07 mxinden