neqo icon indicating copy to clipboard operation
neqo copied to clipboard

fix: Mark all packets TX'ed before PTO as lost

Open larseggert opened this issue 1 year ago • 6 comments

We'd previously only mark 1 one or two packets as lost when a PTO fired. That meant that we potentially didn't RTX all data that we could have (i.e., that was in lost packets that we didn't mark lost).

This also changes the probing code to suppress redundant keep-alives, i.e., PINGs that we sent for other reasons, which could double as keep-alives but did not.

Broken out of #1998

larseggert avatar Sep 19 '24 12:09 larseggert

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to b931a289eee7d62c0815535f01cfa34c5a929f9d.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

github-actions[bot] avatar Sep 19 '24 13:09 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.42%. Comparing base (8b4a9c9) to head (d8f1214).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2129   +/-   ##
=======================================
  Coverage   95.41%   95.42%           
=======================================
  Files         115      115           
  Lines       36996    36992    -4     
  Branches    36996    36992    -4     
=======================================
- Hits        35301    35300    -1     
+ Misses       1689     1686    -3     
  Partials        6        6           
Components Coverage Δ
neqo-common 97.17% <ø> (ø)
neqo-crypto 90.44% <ø> (ø)
neqo-http3 94.50% <ø> (ø)
neqo-qpack 96.29% <ø> (ø)
neqo-transport 96.26% <100.00%> (+0.01%) :arrow_up:
neqo-udp 95.29% <ø> (ø)

codecov[bot] avatar Sep 19 '24 13:09 codecov[bot]

Benchmark results

Performance differences relative to 9b9f28267de1d744e270cbb0725c0e57d170b745.

decode 4096 bytes, mask ff: No change in performance detected.
       time:   [11.853 µs 11.895 µs 11.942 µs]
       change: [-0.1831% +0.2708% +0.7641%] (p = 0.28 > 0.05)

Found 20 outliers among 100 measurements (20.00%) 2 (2.00%) low severe 6 (6.00%) low mild 1 (1.00%) high mild 11 (11.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.
       time:   [3.0685 ms 3.0779 ms 3.0890 ms]
       change: [-0.5254% -0.0469% +0.4310%] (p = 0.88 > 0.05)

Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 9 (9.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [19.791 µs 19.843 µs 19.900 µs]
       change: [-0.4664% -0.0184% +0.3429%] (p = 0.93 > 0.05)

Found 21 outliers among 100 measurements (21.00%) 3 (3.00%) low severe 3 (3.00%) low mild 15 (15.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [5.1465 ms 5.1580 ms 5.1711 ms]
       change: [-0.3820% -0.0391% +0.3158%] (p = 0.83 > 0.05)

Found 13 outliers among 100 measurements (13.00%) 13 (13.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [6.8668 µs 6.8719 µs 6.8815 µs]
       change: [-0.5871% +0.0812% +1.0176%] (p = 0.86 > 0.05)

Found 15 outliers among 100 measurements (15.00%) 5 (5.00%) high mild 10 (10.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.
       time:   [1.7594 ms 1.7663 ms 1.7747 ms]
       change: [-0.1505% +0.4023% +1.0339%] (p = 0.17 > 0.05)

Found 10 outliers among 100 measurements (10.00%) 10 (10.00%) high severe

1 streams of 1 bytes/multistream: No change in performance detected.
       time:   [68.961 µs 69.539 µs 70.573 µs]
       change: [-2.2059% -0.2415% +1.8217%] (p = 0.81 > 0.05)

Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe

1000 streams of 1 bytes/multistream: Change within noise threshold.
       time:   [24.838 ms 24.870 ms 24.903 ms]
       change: [+0.0198% +0.2153% +0.4047%] (p = 0.03 
10000 streams of 1 bytes/multistream: Change within noise threshold.
       time:   [1.6683 s 1.6701 s 1.6720 s]
       change: [+0.0264% +0.1789% +0.3398%] (p = 0.03 Found 23 outliers among 100 measurements (23.00%)
1 (1.00%) low severe
13 (13.00%) low mild
1 (1.00%) high mild
8 (8.00%) high severe
1 streams of 1000 bytes/multistream: No change in performance detected.
       time:   [70.580 µs 71.562 µs 72.991 µs]
       change: [-0.2161% +1.1952% +3.7551%] (p = 0.22 > 0.05)

Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe

100 streams of 1000 bytes/multistream: No change in performance detected.
       time:   [3.3333 ms 3.3398 ms 3.3467 ms]
       change: [-0.3140% -0.0457% +0.2232%] (p = 0.74 > 0.05)

Found 22 outliers among 100 measurements (22.00%) 22 (22.00%) high severe

1000 streams of 1000 bytes/multistream: Change within noise threshold.
       time:   [145.67 ms 145.74 ms 145.82 ms]
       change: [-0.4432% -0.3643% -0.2861%] (p = 0.00 Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [92.530 ns 92.870 ns 93.209 ns]
       change: [-0.4077% +0.0189% +0.4379%] (p = 0.93 > 0.05)

Found 10 outliers among 100 measurements (10.00%) 7 (7.00%) high mild 3 (3.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [110.22 ns 110.52 ns 110.87 ns]
       change: [-0.7203% -0.0344% +0.6664%] (p = 0.93 > 0.05)

Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low mild 2 (2.00%) high mild 7 (7.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [109.80 ns 110.11 ns 110.50 ns]
       change: [-0.6122% -0.0751% +0.4483%] (p = 0.79 > 0.05)

Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) low mild 4 (4.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [91.334 ns 91.371 ns 91.412 ns]
       change: [-14.471% -5.2698% +0.4266%] (p = 0.36 > 0.05)

Found 15 outliers among 100 measurements (15.00%) 6 (6.00%) high mild 9 (9.00%) high severe

RxStreamOrderer::inbound_frame(): No change in performance detected.
       time:   [115.69 ms 115.74 ms 115.80 ms]
       change: [-0.1067% -0.0450% +0.0185%] (p = 0.17 > 0.05)

Found 20 outliers among 100 measurements (20.00%) 8 (8.00%) low severe 2 (2.00%) high mild 10 (10.00%) high severe

SentPackets::take_ranges: No change in performance detected.
       time:   [5.3293 µs 5.4842 µs 5.6501 µs]
       change: [-3.3789% -0.3503% +2.7603%] (p = 0.83 > 0.05)

Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [34.490 ms 34.571 ms 34.654 ms]
       change: [+1.5050% +1.7832% +2.0738%] (p = 0.00 Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
transfer/pacing-true/varying-seeds: Change within noise threshold.
       time:   [34.736 ms 34.798 ms 34.860 ms]
       change: [+2.4377% +2.6805% +2.9250%] (p = 0.00 
transfer/pacing-false/same-seed: :broken_heart: Performance has regressed.
       time:   [34.846 ms 34.889 ms 34.933 ms]
       change: [+3.1366% +3.3257% +3.5059%] (p = 0.00 Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) low mild
3 (3.00%) high mild
transfer/pacing-true/same-seed: Change within noise threshold.
       time:   [35.239 ms 35.295 ms 35.351 ms]
       change: [+2.3837% +2.5987% +2.8080%] (p = 0.00 
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: Change within noise threshold.
       time:   [2.1931 s 2.2014 s 2.2099 s]
       thrpt:  [45.251 MiB/s 45.425 MiB/s 45.597 MiB/s]
change:
       time:   [-1.7534% -1.2122% -0.6459%] (p = 0.00 +1.2271% +1.7847%]

Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: No change in performance detected.
       time:   [389.36 ms 391.71 ms 394.15 ms]
       thrpt:  [25.371 Kelem/s 25.529 Kelem/s 25.683 Kelem/s]
change:
       time:   [-0.8230% -0.0559% +0.7204%] (p = 0.89 > 0.05)
       thrpt:  [-0.7153% +0.0559% +0.8298%]

Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: :broken_heart: Performance has regressed.
       time:   [30.396 ms 31.127 ms 31.832 ms]
       thrpt:  [31.414  elem/s 32.126  elem/s 32.899  elem/s]
change:
       time:   [+3.9740% +7.4931% +11.211%] (p = 0.00 -6.9707% -3.8222%]
1-conn/1-100mb-resp/mtu-1504 (aka. Upload)/client: :broken_heart: Performance has regressed.
       time:   [3.5065 s 3.5291 s 3.5518 s]
       thrpt:  [28.155 MiB/s 28.335 MiB/s 28.518 MiB/s]
change:
       time:   [+7.9570% +9.0897% +10.205%] (p = 0.00 -8.3323% -7.3706%]

Client/server transfer results

Performance differences relative to 9b9f28267de1d744e270cbb0725c0e57d170b745.

Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.

Client Server CC Pacing Mean ± σ Min Max Δ main Δ main
neqo neqo reno on 517.1 ± 103.9 439.6 975.1 5.3 0.3%
neqo neqo reno 560.1 ± 184.3 441.5 1153.2 -2.0 -0.1%
neqo neqo cubic on 502.6 ± 39.3 459.4 628.7 -22.7 -1.1%
neqo neqo cubic 506.8 ± 45.6 451.1 646.4 -2.3 -0.1%
google neqo reno on 908.9 ± 101.9 672.2 1121.7 8.4 0.2%
google neqo reno 911.8 ± 100.7 648.3 1098.7 0.0 0.0%
google neqo cubic on 902.2 ± 95.0 662.6 991.9 2.7 0.1%
google neqo cubic 900.2 ± 95.2 646.0 984.0 0.9 0.0%
google google 563.1 ± 53.7 532.4 774.9 8.2 0.4%
neqo msquic reno on 224.1 ± 32.6 199.8 382.6 6.8 0.8%
neqo msquic reno 220.5 ± 19.2 199.0 278.0 -6.0 -0.7%
neqo msquic cubic on 214.2 ± 10.2 198.8 242.9 -6.3 -0.7%
neqo msquic cubic 217.5 ± 14.9 196.8 254.9 -1.7 -0.2%
msquic msquic 129.8 ± 43.2 97.9 332.8 -2.1 -0.4%

:arrow_down: Download logs

github-actions[bot] avatar Sep 19 '24 13:09 github-actions[bot]

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

github-actions[bot] avatar Sep 19 '24 15:09 github-actions[bot]

@martinthomson I'd appreciate a review, since the code I am touching is pretty complex.

larseggert avatar Sep 20 '24 07:09 larseggert

Do we not have tests for this? Should we?

There are tests in #2128, but this PR alone doesn't make them succeed yet.

larseggert avatar Sep 26 '24 10:09 larseggert

We'd previously only mark 1 one or two packets as lost when a PTO fired. That meant that we potentially didn't RTX all data that we could have (i.e., that was in lost packets that we didn't mark lost).

Will work on a unit test for this in a bit.

This also changes the probing code to suppress redundant keep-alives, i.e., PINGs that we sent for other reasons, which could double as keep-alives but did not.

I suggest merging this fix through https://github.com/mozilla/neqo/pull/2509.

mxinden avatar Mar 19 '25 11:03 mxinden

We'd previously only mark 1 one or two packets as lost when a PTO fired. That meant that we potentially didn't RTX all data that we could have (i.e., that was in lost packets that we didn't mark lost).

Will work on a unit test for this in a bit.

Unit test added through https://github.com/mozilla/neqo/pull/2129/commits/954dc50941f863e0e72e3494339615e461bf0e55.

This also changes the probing code to suppress redundant keep-alives, i.e., PINGs that we sent for other reasons, which could double as keep-alives but did not.

I suggest merging this fix through #2509.

Let's wait for #2509 to merge. Once merged, I will update this pull request, thus reducing the diff.

mxinden avatar Mar 19 '25 14:03 mxinden

@mxinden anything left to be done for this one?

larseggert avatar Mar 21 '25 14:03 larseggert