fix: No time-based loss detection before `largest_acked` & fixes
I will refactor this PR into smaller standalone ones.
#1603 broke this, by falsely enabling time-based loss detection before we have received a valid largest_acked.
Once a later packet within the same packet number space has been acknowledged, an endpoint SHOULD declare an earlier packet lost if it was sent a threshold amount of time in the past.
This makes it clear that time-based LD should be disabled before an ACK is received.
assert(largest_acked_packet[pn_space] != infinite)
The DetectAndRemoveLostPackets pseudo code also has the assert above.
When a PTO fires, we now explicitly also do a congestion reaction.
Fixes #1874
Also includes some fixes to hopefully make retransmissions during the handshake more robust, by timing out all packets in all packet number spaces affected by a PTO (making their contents eligible for RTX) and by not applying ack_delay to the application packet number space PTO during the handshake.
Failed Interop Tests
None :question:
All results
Succeeded Interop Tests
None :question:
Unsupported Interop Tests
None :question:
Benchmark results
Performance differences relative to d6279bf40c91c4cb45c358d5f614dc6190e4c9eb.
coalesce_acked_from_zero 1+1 entries: No change in performance detected.
time: [99.781 ns 100.06 ns 100.35 ns]
change: [-0.0691% +0.3854% +0.8727%] (p = 0.13 > 0.05)
Found 12 outliers among 100 measurements (12.00%)
9 (9.00%) high mild
3 (3.00%) high severe
coalesce_acked_from_zero 3+1 entries: No change in performance detected.
time: [117.84 ns 118.21 ns 118.61 ns]
change: [-0.1498% +0.3222% +0.8142%] (p = 0.19 > 0.05)
Found 21 outliers among 100 measurements (21.00%)
3 (3.00%) low severe
1 (1.00%) low mild
4 (4.00%) high mild
13 (13.00%) high severe
coalesce_acked_from_zero 10+1 entries: No change in performance detected.
time: [117.66 ns 118.18 ns 118.78 ns]
change: [-0.3301% +0.3180% +0.8978%] (p = 0.32 > 0.05)
Found 21 outliers among 100 measurements (21.00%)
6 (6.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
10 (10.00%) high severe
coalesce_acked_from_zero 1000+1 entries: Change within noise threshold.
time: [98.191 ns 98.432 ns 98.741 ns]
change: [+0.2702% +1.2029% +2.1014%] (p = 0.01 Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severeRxStreamOrderer::inbound_frame(): No change in performance detected.
time: [110.98 ms 111.11 ms 111.30 ms]
change: [-0.1477% -0.0232% +0.1488%] (p = 0.82 > 0.05)
Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
transfer/pacing-false/varying-seeds: No change in performance detected.
time: [28.531 ms 29.337 ms 30.147 ms]
change: [-0.8630% +4.2908% +9.9250%] (p = 0.12 > 0.05)
transfer/pacing-true/varying-seeds: Change within noise threshold.
time: [37.814 ms 39.294 ms 40.754 ms]
change: [+0.5884% +7.1476% +14.299%] (p = 0.04 Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mildtransfer/pacing-false/same-seed: :green_heart: Performance has improved.
time: [20.227 ms 21.095 ms 21.965 ms]
change: [-23.657% -19.714% -15.468%] (p = 0.00 transfer/pacing-true/same-seed: :green_heart: Performance has improved.
time: [20.852 ms 21.936 ms 23.034 ms]
change: [-52.505% -49.129% -45.393%] (p = 0.00 1-conn/1-100mb-resp (aka. Download)/client: No change in performance detected.
time: [113.70 ms 114.11 ms 114.50 ms]
thrpt: [873.36 MiB/s 876.33 MiB/s 879.50 MiB/s]
change:
time: [-0.7008% -0.2614% +0.1863%] (p = 0.26 > 0.05)
thrpt: [-0.1860% +0.2621% +0.7057%]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low severe
1 (1.00%) low mild
1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.
time: [312.75 ms 316.12 ms 319.49 ms]
thrpt: [31.300 Kelem/s 31.633 Kelem/s 31.974 Kelem/s]
change:
time: [-2.4277% -0.8752% +0.6564%] (p = 0.27 > 0.05)
thrpt: [-0.6521% +0.8830% +2.4881%]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
1-conn/1-1b-resp (aka. HPS)/client: No change in performance detected.
time: [33.867 ms 34.070 ms 34.294 ms]
thrpt: [29.160 elem/s 29.351 elem/s 29.527 elem/s]
change:
time: [-1.5619% -0.6708% +0.2182%] (p = 0.13 > 0.05)
thrpt: [-0.2177% +0.6753% +1.5866%]
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low mild
2 (2.00%) high mild
7 (7.00%) high severe
Client/server transfer results
Transfer of 33554432 bytes over loopback.
| Client | Server | CC | Pacing | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|---|---|---|
| msquic | msquic | 173.1 ± 109.5 | 100.9 | 564.9 | 1.00 | ||
| neqo | msquic | reno | on | 243.7 ± 61.5 | 212.3 | 434.9 | 1.00 |
| neqo | msquic | reno | 257.1 ± 67.1 | 218.3 | 428.7 | 1.00 | |
| neqo | msquic | cubic | on | 242.9 ± 63.0 | 213.1 | 439.9 | 1.00 |
| neqo | msquic | cubic | 294.2 ± 98.2 | 217.5 | 455.0 | 1.00 | |
| msquic | neqo | reno | on | 174.8 ± 101.8 | 81.9 | 336.3 | 1.00 |
| msquic | neqo | reno | 147.6 ± 94.0 | 84.3 | 360.6 | 1.00 | |
| msquic | neqo | cubic | on | 139.7 ± 78.4 | 83.8 | 324.7 | 1.00 |
| msquic | neqo | cubic | 137.2 ± 89.0 | 80.5 | 373.3 | 1.00 | |
| neqo | neqo | reno | on | 196.4 ± 71.6 | 129.8 | 366.3 | 1.00 |
| neqo | neqo | reno | 190.6 ± 95.6 | 125.1 | 434.5 | 1.00 | |
| neqo | neqo | cubic | on | 196.8 ± 74.1 | 131.4 | 381.0 | 1.00 |
| neqo | neqo | cubic | 249.7 ± 111.5 | 133.1 | 471.2 | 1.00 |
Firefox builds for this PR
The following builds are available for testing. Crossed-out builds did not succeed.
- Linux: ~~Debug~~ ~~Release~~
- macOS: ~~Debug~~ ~~Release~~
- Windows: ~~Debug~~ ~~Release~~
Codecov Report
Attention: Patch coverage is 99.08257% with 1 line in your changes missing coverage. Please review.
Please upload report for BASE (
main@d6279bf). Learn more about missing BASE report.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| neqo-transport/src/path.rs | 87.50% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #1998 +/- ##
=======================================
Coverage ? 95.31%
=======================================
Files ? 112
Lines ? 36352
Branches ? 0
=======================================
Hits ? 34648
Misses ? 1704
Partials ? 0
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Kwik failure: https://github.com/ptrd/kwik/issues/46
Quinn failure: https://github.com/quinn-rs/quinn/issues/1929
Is there any point in undoing some more of the changes from #1603?
Still looking into the performance drop in "transfer/Run multiple transfers with varying seeds".
We should also look into why pacing is making things slower:
neqo neqo reno on 415.4 ± 39.1 375.7 480.0 1.00
neqo neqo reno 206.9 ± 21.3 181.8 242.1 1.00
neqo neqo cubic on 450.3 ± 57.6 385.6 575.1 1.00
neqo neqo cubic 285.0 ± 168.1 168.5 610.7 1.00
Maybe saving others some clicks, here are the numbers of a recent bench run on main:
neqo neqo reno on 196.1 ± 19.4 172.1 236.5 1.00
neqo neqo reno 174.4 ± 10.3 153.0 195.1 1.00
neqo neqo cubic on 194.6 ± 27.0 173.3 275.2 1.00
neqo neqo cubic 175.7 ± 5.9 168.8 188.3 1.00
https://github.com/mozilla/neqo/actions/runs/10051108975
Pacing will make things slower, but might make things nicer on the network. It should be a small delta, not a big one.
So I bisected this, and b6e05e304b1ef582bc232710d3e1177fb5b371fe is the first bad commit (where performance of that benchmark drops).
There are really only two changes in there, one being the removal or the max_ack_delay addition from the loss_delay, the other being the fix to not do time-based loss recovery before having a largest_acked. I fail to see how either of these could cause this, however. Hm.
I validated that the key fix here (not doing time-based loss recovery while we don't have a largest_acked) causes the drop in benchmark performance. I can't quite figure out why that would be, given that we're now not over-aggressively retransmitting packets.
And when I added #1603, we didn't have the benchmarks yet (nor lots of other things), so it's not so straightforward to compare the pre-1603 performance to the one now.
Thoughts?
A guess @larseggert:
- The previous version, time-based loss recovery while we don't have a
largest_acked, leads to more packets being declared lost. - More packets being declared lost reduces the congestion window.
- A small congestion window requires the
Pacerto pace the packets over a smaller time window. (See correlation of pacer and congestion window.) - Thus smaller spaces between packets.
- Thus higher throughput on a low-latency (localhost) connection.
- A small congestion window requires the
Pacerto pace the packets over a smaller time window. (See correlation of pacer and congestion window.)
Sorry for having been unclear. The performance reduction I was looking into was the drop in "transfer/Run multiple transfers with varying seeds" bench, not the drop related to pacing visible in the comparison table.
Both might have the same root cause @larseggert, no? "transfer/Run multiple transfers with varying seeds" uses the default ConnectionParameters. Thus, pacing is enabled.
https://github.com/mozilla/neqo/blob/608c0a72851a55f73dcfc9f206143748b36fd1bb/neqo-transport/src/connection/params.rs#L105
Also note that the "transfer/Run multiple transfers with varying seeds" benchmark includes packet loss.
https://github.com/mozilla/neqo/blob/608c0a72851a55f73dcfc9f206143748b36fd1bb/test-fixture/src/sim/taildrop.rs#L82-L87
Running this benchmark without pacing (e.g. via https://github.com/mozilla/neqo/pull/2006), increases performance, potentially back to the baseline.
You're right, it might be the same cause.
Looking at the first 50 pace.rs logs of the "transfer/Run multiple transfers with varying seeds" on main and on 0rtt-tmp, more specifically at the following lines.
https://github.com/mozilla/neqo/blob/88d3ed105db1b1a805a7b8c3b64078008069a403/neqo-transport/src/pace.rs#L77
https://github.com/mozilla/neqo/blob/88d3ed105db1b1a805a7b8c3b64078008069a403/neqo-transport/src/pace.rs#L87
mainreduces itscwndquickly to6160, while0rtt-tmpstays at12320.maindoes not pace (i.e.no wait) while0rtt-tmppaces within the first 50 logs (i.e.wait).
main
➜ neqo git:(main) ✗ RUST_LOG=neqo_transport::pace=trace cargo test --bench transfer --features bench varying 2>&1 | head -n 50
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.06s
Running benches/transfer.rs (target/debug/deps/transfer-7ba07c4aa3c126f9)
Testing transfer/Run multiple transfers with varying seeds
0s 0ms TRACE [Pacer 2464/1232] spend 1232 over 12320, 100ms
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 561055565 }
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 561055565 }
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 561055565 }
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 561055565 }
0s 2ms TRACE [Pacer 2464/1232] spend 671 over 12320, 100ms
0s 2ms TRACE [Pacer 1793/1232] spend 561 over 12320, 100ms
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 561055565 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 2ms TRACE [Pacer 1232/1232] spend 1232 over 6160, 100ms
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 2ms TRACE [Pacer 2464/1232] next 6160/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 675862863 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 2ms TRACE [Pacer 2464/1232] next 6160/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 675862863 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 4ms TRACE [Pacer 2464/1232] spend 50 over 6160, 124.38313ms
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 4ms TRACE [Pacer 2464/1232] spend 81 over 6160, 124.38313ms
0s 4ms TRACE [Pacer 2383/1232] spend 234 over 6160, 124.38313ms
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 5ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 5ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 5ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 5ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 5ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 5ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 5ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 5ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 624566863 }
0s 5ms TRACE [Pacer 1232/1232] spend 1232 over 12320, 100ms
0s 5ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 5ms TRACE [Pacer 2464/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 733268221 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2464/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 733268221 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2464/1232] next 12320/100ms no wait = Instant { tv_sec: 12085, tv_nsec: 733268221 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 6ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 7ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 7ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 7ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0s 7ms TRACE [Pacer 2464/1232] spend 50 over 12320, 119.752079ms
0s 8ms TRACE [Pacer 2464/1232] spend 1282 over 12320, 119.752079ms
0s 8ms TRACE [Pacer 2149/1232] next 6160/124.38313ms no wait = Instant { tv_sec: 12085, tv_nsec: 685438695 }
0rtt-tmp
➜ neqo git:(0rtt-tmp) ✗ RUST_LOG=neqo_transport::pace=trace cargo test --bench transfer --features bench varying 2>&1 | head -n 50
Compiling neqo-transport v0.8.1 (/home/mxinden/code/github.com/mozilla/neqo/neqo-transport)
Compiling neqo-qpack v0.8.1 (/home/mxinden/code/github.com/mozilla/neqo/neqo-qpack)
Compiling neqo-http3 v0.8.1 (/home/mxinden/code/github.com/mozilla/neqo/neqo-http3)
Compiling test-fixture v0.8.1 (/home/mxinden/code/github.com/mozilla/neqo/test-fixture)
Finished `test` profile [unoptimized + debuginfo] target(s) in 2.85s
Running benches/transfer.rs (target/debug/deps/transfer-7ba07c4aa3c126f9)
Testing transfer/Run multiple transfers with varying seeds
0s 0ms TRACE [Pacer 2464/1232] spend 1232 over 12320, 100ms
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 0ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 2ms TRACE [Pacer 2464/1232] spend 664 over 12320, 100ms
0s 2ms TRACE [Pacer 1800/1232] spend 568 over 12320, 100ms
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 788940966 }
0s 2ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 3ms TRACE [Pacer 1232/1232] spend 51 over 12320, 118.702751ms
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 4ms TRACE [Pacer 2464/1232] spend 82 over 12320, 118.702751ms
0s 4ms TRACE [Pacer 2382/1232] spend 256 over 12320, 118.702751ms
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 4ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 4ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 4ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 4ms TRACE [Pacer 1232/1232] next 12320/100ms no wait = Instant { tv_sec: 12154, tv_nsec: 854132492 }
0s 4ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 5ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 5ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 5ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 5ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 5ms TRACE [Pacer 1232/1232] spend 51 over 12320, 112.760801ms
0s 6ms TRACE [Pacer 2464/1232] spend 1281 over 12320, 112.760801ms
0s 6ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 6ms TRACE [Pacer 1183/1232] next 12320/112.760801ms wait 224.24µs = Instant { tv_sec: 12155, tv_nsec: 19325906 }
0s 6ms TRACE [Pacer 1183/1232] next 12320/112.760801ms wait 224.24µs = Instant { tv_sec: 12155, tv_nsec: 19325906 }
0s 6ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 6ms TRACE [Pacer 1183/1232] next 12320/112.760801ms wait 224.24µs = Instant { tv_sec: 12155, tv_nsec: 19325906 }
0s 7ms TRACE [Pacer 1183/1232] spend 1332 over 12320, 112.760801ms
0s 7ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
0s 7ms TRACE [Pacer 0/1232] next 12320/112.760801ms wait 5.63804ms = Instant { tv_sec: 12155, tv_nsec: 24963946 }
0s 7ms TRACE [Pacer 0/1232] next 12320/112.760801ms wait 5.63804ms = Instant { tv_sec: 12155, tv_nsec: 24963946 }
0s 7ms TRACE [Pacer 2126/1232] next 12320/118.702751ms no wait = Instant { tv_sec: 12154, tv_nsec: 907643717 }
mainreduces itscwndquickly to6160, while0rtt-tmpstays at12320.
That is probably because main erroneously marks 0-RTT as lost, and does a congestion reaction.
maindoes not pace (i.e.no wait) while0rtt-tmppaces within the first 50 logs (i.e.wait).
It might be that because main is in the recovery state, it can immediately (= not paced) retransmit a single packet.
In other words, the bug in main might make it over-aggressive, and 0rtt-tmp behaves (more) correctly.
maindoes not pace (i.e.no wait) while0rtt-tmppaces within the first 50 logs (i.e.wait).It might be that because
mainis in the recovery state, it can immediately (= not paced) retransmit a single packet.
Only when we send a packet < mtu and we are in RecoveryStart do we ignore the Pacer:
https://github.com/mozilla/neqo/blob/b63e0ff567342a220a9633d1c3ff2b15706bf5bb/neqo-transport/src/recovery/mod.rs#L914-L930
Would that explain the near 100% performance regression on the client/server transfers with pacing enabled?
We should also look into why pacing is making things slower:
neqo neqo reno on 415.4 ± 39.1 375.7 480.0 1.00 neqo neqo reno 206.9 ± 21.3 181.8 242.1 1.00 neqo neqo cubic on 450.3 ± 57.6 385.6 575.1 1.00 neqo neqo cubic 285.0 ± 168.1 168.5 610.7 1.00
I am still puzzled about these benchmark regressions. I looked into it some more, but without success. I don't think my above guesses are valid.
@larseggert do you have further ideas worth investigating?
I am fine merging here even with the regression. I think bugfixes should take precedence over performance. That said, I would like to investigate further and ideally not uplift to Firefox before we found the reason.
I will take a look at qlogs next. I can reproduce the regression with the Simulator, which should allow me to generate qlogs without impacting run time, given that time is simulated.
I wet back to fbc5d62697c968dec968941d4c495f4fa6869c9d, which was the state of the code before #1603 was merged. I needed to add the ability to turn off pacing on the server, and comment out some noisy eprintln! stuff. I then did a hyperfine run, comparing our client and server with Reno, with and without pacing.
Without pacing:
Time (mean ± σ): 471.5 ms ± 16.8 ms [User: 305.3 ms, System: 114.1 ms]
Range (min … max): 430.3 ms … 485.9 ms 10 runs
With pacing:
Time (mean ± σ): 523.7 ms ± 14.0 ms [User: 299.6 ms, System: 120.2 ms]
Range (min … max): 495.8 ms … 546.4 ms 10 runs
In other words, we did see a similar performance difference before #1603 as we are seeing now.
However, we also see this performance difference with #1603. I'm now bisecting to see when that changed.
I did some comparison runs between this PR and HEAD. It seems that the performance difference between pacing and non-pacing is only present if the loopback MTU is 64KB.
This PR, MTU 1500:
Benchmark 1: target/release/neqo-client --cc reno --output-dir /tmp/tmp.w6tjsiaAdZ -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 552.8 ms ± 67.7 ms [User: 290.3 ms, System: 184.4 ms]
Range (min … max): 495.9 ms … 958.6 ms 100 runs
Benchmark 1: target/release/neqo-client --cc reno --no-pacing --output-dir /tmp/tmp.w6tjsiaAdZ -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 552.9 ms ± 47.8 ms [User: 285.3 ms, System: 182.3 ms]
Range (min … max): 490.8 ms … 772.2 ms 100 runs
This PR, MTU 64KB:
Benchmark 1: target/release/neqo-client --cc reno --output-dir /tmp/tmp.gKIfqAvJbs -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 397.0 ms ± 121.4 ms [User: 149.0 ms, System: 57.5 ms]
Range (min … max): 227.4 ms … 692.2 ms 100 runs
Benchmark 1: target/release/neqo-client --cc reno --no-pacing --output-dir /tmp/tmp.gKIfqAvJbs -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 262.3 ms ± 18.2 ms [User: 148.3 ms, System: 56.8 ms]
Range (min … max): 232.2 ms … 314.3 ms 100 runs
HEAD, MTU 1500:
Benchmark 1: target/release/neqo-client --cc reno --output-dir /tmp/tmp.GHDuGPjylE -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 578.8 ms ± 48.4 ms [User: 311.1 ms, System: 211.9 ms]
Range (min … max): 520.0 ms … 766.4 ms 100 runs
Benchmark 1: target/release/neqo-client --cc reno --no-pacing --output-dir /tmp/tmp.GHDuGPjylE -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 577.3 ms ± 39.5 ms [User: 327.2 ms, System: 213.6 ms]
Range (min … max): 540.2 ms … 789.1 ms 100 runs
HEAD, MTU 64KB:
Benchmark 1: target/release/neqo-client --cc reno --output-dir /tmp/tmp.6wBXumkvZX -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 256.8 ms ± 34.3 ms [User: 144.6 ms, System: 52.7 ms]
Range (min … max): 221.5 ms … 501.1 ms 100 runs
Benchmark 1: target/release/neqo-client --cc reno --no-pacing --output-dir /tmp/tmp.6wBXumkvZX -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 245.1 ms ± 23.4 ms [User: 142.5 ms, System: 52.7 ms]
Range (min … max): 220.5 ms … 432.7 ms 100 runs
I wouldn't read too much into those differences. A larger MTU is going to mean that overheads are more pronounced.
Also, pacing is always going to be slower in this sort of test. Pacing is designed to be gentle on queues in the network at the cost of extra CPU and delay at the endpoint that paces. As long as the cost is small, we're good.
On further investigation, this seems to be related to what PACER_SPEEDUP is set to. When I double it to 4, the difference for this PR with a 64KB goes away:
Benchmark 1: target/release/neqo-client --cc reno --output-dir /tmp/tmp.AP82nSCWV4 -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 268.4 ms ± 36.8 ms [User: 155.3 ms, System: 53.1 ms]
Range (min … max): 230.9 ms … 514.0 ms 100 runs
Benchmark 1: target/release/neqo-client --cc reno --no-pacing --output-dir /tmp/tmp.AP82nSCWV4 -o -a hq-interop https://127.0.0.1:4434/134217728
Time (mean ± σ): 264.8 ms ± 36.1 ms [User: 158.5 ms, System: 58.4 ms]
Range (min … max): 228.5 ms … 503.1 ms 100 runs
Given all this, I think we can merge this.
This isn't quite ready yet, because in the process of fixing the test failure after merging HEAD, I realized we have another bug: When there is packet loss during the handshake, we do not always ACK when we should be ACK'ing. That's because when we send an ACK, we reset the ACK timer (even during the handshake) and that causes ACKs to be not sent when they should be (on retransmit).
I'm working on a fix, and want to see that merged before continuing on this one, because that fix might impact what needs to change in this PR.
That other PR is #2027.