qlog
qlog copied to clipboard
add ECN events
Endpoints using ECN will perform validation of ECN, as described in https://datatracker.ietf.org/doc/html/rfc9000#section-13.4.2. There are a few different conditions that can lead to a failure of ECN validation, which in turns leads to disabling of ECN for that path. We should have an event that tells us that ECN validation failed and why.
This sounds good. I'd like to pull in people with more ECN experience to be part of the discussion. Asking on the list seems like the best way to do that.
Agreed we could use another perspective on this.
Maybe @huitema has input on this? Probably also @goelvidhi, who's been working with L4S etc. at Apple and also has used qlog?
Yes, having an event stating that ECN negotiation failed would be nice.
ECN negotiation is per path, and whatever logging we do has to reflect that. A connection can migrate from a path that supports ECN to one that does not; nodes are supposed to perform the validation after each migration event, or in the case of multipath after each path setup.
I don't remember whether we are logging the ECN bits in the "datagram_sent" and "datagram_received" events. That might be a good idea.
Another question is how ECN probing works. I haven't managed to come with a good algorithm yet (that's why quic-go isn't sending ECN marks). Do you start the handshake with ECN marks, run into a timeout, conclude that ECN is blackholed, and restart the handshake? Or do you occasionally send packets with ECN marks after handshake completion, and switch on ECN marking once you've received an ACK for such a probe packet? That would be very similar to the DPLPMTUD probing logic.
I managed to sync with @goelvidhi during the last IETF meeting. The ECN validation mechanism described in A.4 of RFC 9000 apparently is sound, and can be safely implemented with little modifications.
Given that ECN probing is a state machine, it would make sense to log it as such:
ecn_state_updated {
new ECNState
? old ECNState
}
ECNState = {
"testing" / "unknown" / "failed" / "capable"
}
Maybe we should also add a state before "testing", as I think it's a valid implementation strategy to not bother with ECN during the handshake, and only apply ECN markings after completion of the handshake.
Regarding L4S / Prague, there will be more details needed on this event. Given that these are still draft / experimental RFCs, maybe we can punt this though?
Agree with punting details for L4S / Prague to later documents. But, as I mentioned on the PR, I would make the ECN field contents explicitly extensible already.
Sorry for late reply. But I don't think L4S/Prague should be punted. Even though the RFCs are experimental, L4S is getting deployed (with Apple shipping it and Comcast currently doing user trials).
I think we need to see implementer interest in adding qlog support for l4s / Prague. The qlog editor team doesn't have enough expertise and we need to prioritize the work we already have left to do.
Agree with Lucas here. I agree L4S is useful of course (and congrats on the live experiments!), but if people want it in qlog it should be an extension, not part of the core documents at this point.
@goelvidhi Do we need anything special for L4s? As far as my limited understanding goes, the only thing required on the wire is using the other ECT code point, which we can already encode when this PR is merged.
Where we'd need additional events is when we want to log events specific to the Prague state machine (that don't map to NewReno). But this is a problem that already exists today, for example, the qlog documents don't tell you how to log BBR events.