Consider reducing MAX_PATH_PROBES with ECN
Currently we probe a path up to 6 times, i.e. 3 times with ECN, 3 times without ECN.
https://github.com/mozilla/neqo/blob/99144585db35fb7dd7e60a78019df5d184ecad51/neqo-transport/src/path.rs#L35-L39
On Firefox Nightly we see around ~2% of connection attempts seeing an ECN black-hole. In other words for 2% of connection attempts the first 3 path probes with ECN fail and one of the consecutive path probes without ECN succeed.
https://yardstick.mozilla.org/d/aeak3dvriig3kd/http3?orgId=1&from=now-7d&to=now&timezone=browser&viewPanel=panel-3
If I understand correctly our initial PTO should be ~100ms. Thus ~2% of HTTP3 connections get delayed by 300ms. I would assume most of these ~2% thus loose the race to a concurrent HTTP2 connection attempt
Given the insights from Firefox Nightly, should we reduce the first MAX_PATH_PROBES with ECN from 3 to e.g. 1? Or do we consider ~2% not significant enough?
Should we do an experiment?
What metric would you seek to optimize? HTTP/3 usage? ECN usage? Connection establishment time?
Should we do an experiment?
We can. That said, having done one myself now, I am not sure it is worth the bureaucratic overhead.
Given that Firefox Nightly is a small population only, and given the overhead of an experiment, how about we enable ECN on Firefox Early Beta first. Rolling out to more devices gives us higher confidence in the percentage of ECN black holes seen by Firefox users. In case the percentage of black holes is still relatively high, we either (a) do an experiment or (b) make a change like the one suggested above.
What metric would you seek to optimize? HTTP/3 usage? ECN usage? Connection establishment time?
Metrics worth monitoring:
- Number of HTTP responses by version via https://dictionary.telemetry.mozilla.org/apps/firefox_desktop/metrics/http_response_version . A drop in http3 might signal http3 connection establishment latency increase and thus http2 winning over http3.
- Number of blackholed connection attempts via https://dictionary.telemetry.mozilla.org/apps/firefox_desktop/metrics/networking_http_3_ecn_path_capability
- Connection establishment time via https://dictionary.telemetry.mozilla.org/apps/firefox_desktop/metrics/network_sup_http3_tcp_connection
If I understand correctly our initial PTO should be ~100ms. Thus ~2% of HTTP3 connections get delayed by 300ms. I would assume most of these ~2% thus loose the race to a concurrent HTTP2 connection attempt
Once we no longer do time threshold based loss detection before the first ACK (i.e. https://github.com/mozilla/neqo/pull/2492), this is wrong. Our initial RTT is 100ms. Thus the first PTO is 100ms + 4*rttvar where rttvar is RTT/2, i.e. 300ms. The second PTO is 600ms. The third PTO is 1200ms. In sum, this is 2100ms before we detect an ECN blackhole and thus stop marking with Ect0.
See changes to handshake_delay_with_ecn_blackhole test in https://github.com/mozilla/neqo/pull/2492/.
We are currently rolling out ECN support to Firefox Early Beta. Unless that reveals significantly other numbers than shared above, I think we need to take action before going to Firefox Release.
Options thus far:
- Reduce the number of
MAX_PATH_PROBESwith ECN. - Reduce the initial PTO time with ECN.
- Only start marking packets after the handshake is done.
As a quick fix, I would suggest doing either or both of your first two bullets. If we then still see issues, maybe do the third bullet.
I don't find reports by others with such high ECN black-hole rates (i.e. > 2%). Thus I slightly doubt the metrics I introduced.
We previously reduced the set of connection failures we consider an ECN blackhole via https://phabricator.services.mozilla.com/D239884.
I propose restricting this set even further, only considering a path to be an ECN blackhole, if the connection handshake succeeds after ECN black-hole detection, i.e. without ECN marking. Linking https://phabricator.services.mozilla.com/D244507 here for the record.
Firefox ECN roll out - status update
Previously we didn't differentiate the roll out of ECN in ECN marking and ECN reporting, but instead wanted to roll out both together.
ECN reporting
The issue at hand only affects ECN marking, not ECN reporting. Thus, with Bug 1961340, we will start shipping ECN reporting already.
ECN marking
This GitHub issue is the only blocker for the roll out of ECN marking. That said, ...
I propose restricting this set even further, only considering a path to be an ECN blackhole, if the connection handshake succeeds after ECN black-hole detection, i.e. without ECN marking. Linking https://phabricator.services.mozilla.com/D244507 here for the record.
Since https://phabricator.services.mozilla.com/D244507 landed in Firefox Nightly, the number of ECN blackholes we see reduced significantly. Below a graph showing the amount of ECN blackholes (blue), relative to the amount of ECN capable paths (green). We now see less than 0.1% of paths blackholing ECN on Firefox Nightly.
This indicates that our previous measurements were simply off. I will continue to monitor these metrics in the next couple of days. That said, I don't expect a significant change.
Seeing less than 0.1% of paths blackholing ECN, I don't think this GitHub issue is still relevant. While we continue to have a significant connection establishment delay on ECN blackhole paths, I argue that with the small number of ECN blackholes in total, this problem is not worth fixing.
With this new insight, I suggest we also have ECN marking ride the Firefox release trains. Objections?