qlog icon indicating copy to clipboard operation
qlog copied to clipboard

Additional triggers and info for dropped packets

Open rmarx opened this issue 6 years ago • 8 comments

The current text lists about 8 reasons for dropping packets. Microsoft's implementation lists 60+ individual reasons (via @nibanks)

Some of those:

LogDrop("Retry sent to server"); LogDrop("Already received server response"); LogDrop("No room for ODCID"); LogDrop("Invalid ODCID"); LogDrop("InitialToken alloc failed"); LogDrop("OrigCID alloc failed"); LogDrop("Max deferred datagram count reached"); LogDrop("Key no longer accepted"); LogDrop("SH packet during version negotiation"); LogDrop("Too short for HP"); LogDrop("Packet number too big"); LogDrop("Payload length less than encryption tag"); LogDrop("Generate new packet keys"); LogDrop("Decryption failure"); LogDrop("Invalid SH Reserved bits values"); LogDrop("Invalid LH Reserved bits values"); LogDrop("Duplicate packet number"); LogDrop("Key no longer accepted (batch)"); LogDrop("Failed to compute HP mask"); LogDrop("Different remote address"); LogDrop("Too small for Packet->Invariant"); LogDrop("LH no room for DestCID"); LogDrop("Zero length DestCID"); LogDrop("LH no room for SourceCID"); LogDrop("SH no room for DestCID"); LogDrop("DestCID don't match"); LogDrop("SourceCID don't match"); LogDrop("Greater than allowed max CID length"); LogDropWithValue("Invalid client/server packet type", Packet->LH->Type); LogDrop("Invalid LH FixedBit bits values"); LogDrop("Long header has invalid token length"); LogDropWithValue("Long header has token length larger than buffer length", TokenLengthVarInt); LogDrop("Long header has invalid payload length"); LogDropWithValue("Long header has length larger than buffer length", LengthVarInt); LogDropWithValue("Long Header doesn't have enough room for packet number", LogDrop("Invalid SH FixedBit bits values"); LogDrop("Non-initial packet not matched with a Connection"); LogDrop("Retry received after initial");

I feel that we don't need to list things in this level of detail in the qlog spec (the "trigger" field allows any text anyway). However, maybe some guidance text on this would be helpful and maybe a few more suggested triggers would be interesting.

rmarx avatar Oct 09 '19 09:10 rmarx

Wonder how many reasons Microsoft has for dropping in packets in 2022. Any input @nibanks? What are your top 20 used reasons?

I think we need to be judiscious with minimizing scope in the qlog document. Unless the triggers are common and self explanatory, we'll just risk confounding people with options or tying ourselves in knots with text to explain them.

Our escape valve is that this is free form text and probably not commonly used for anything special. Being frugal in the base spec is an advantage.

LPardue avatar Sep 07 '22 23:09 LPardue

fwiw, this is what quic-go has: https://github.com/lucas-clemente/quic-go/blob/62b82789c0cc8a9acf48ee516ce2dd303253f395/logging/types.go#L36-L59

It seems like many of @nibanks' reasons are just variations of a generic "header parse error", and could be logged as details to that event.

marten-seemann avatar Sep 08 '22 11:09 marten-seemann

In my experience, if you're debugging an incident, you need to know the exact line of code, the exact reason why a drop happened, because you need to figure out how to fix it right then. If you only log enough to narrow the scope down to a dozen different code places, that doesn't help.

I don't care if there is one field for all the info, or two, with one of them being the "detailed_info" or something, but all the details need to be captured to make it useful.

nibanks avatar Sep 08 '22 12:09 nibanks

I agree deployments would want the ability to log to that level.

Let's look at HTTP status codes for example. Proxies can fail requests with high level 4xx or 5xx status codes and then use the proxy-status header to include additional data https://datatracker.ietf.org/doc/html/rfc9209; namely the error parameter. HTTP also has problem details https://www.ietf.org/archive/id/draft-ietf-httpapi-rfc7807bis-04.html.

This is effectively a 3 layer model of coarseness - limited top level, broader mid level, very specific low layer. Each of these layers is also well structured, which makes interoperability easier - I.e. grepping out specific cases for alerting or investigation.

To map to qlog we could have a limited set of top-level triggers, an optional error type, and a field for detailed info. The last field is likely to be text but in theory you could indicate the format of that text for interop. E.g., if I want to encode JSON in event and singal that to consumers explicitly, using a media type could help. But is suspect text is probably enough for a lot of folks.

LPardue avatar Sep 08 '22 12:09 LPardue

just adding a datapoint in favor for being as specific as possible about the reason a packet got dropped on the floor in MVFST we have about 40 different reasons and as @nibanks says they are useful to identify the specific line of code that drops the packet https://github.com/facebookincubator/mvfst/blob/main/quic/state/QuicTransportStatsCallback.h#L30-L73

That being said encoding this information with a generic "Packet Dropped" event and an optional free-form reason/detailed_info field seems like the best course of action to me since it's not up to qlog specs to define which errors are useful and which ones are less useful.

lnicco avatar Sep 08 '22 14:09 lnicco

Looking at this, I would agree keeping the list in qlog itself quite short and providing a generic field where implementations can add any type of freeform info they desire.

For this, we need 2 things:

  1. Decide WHICH of these current triggers on packet_dropped we want to keep (or do we want to rename/exchange some maybe?). Probably best coming from someone with a large deployment (@LPardue @nibanks @lnicco):
        "key_unavailable" /
        "unknown_connection_id" /
        "header_parse_error" /
        "payload_decrypt_error" /
        "protocol_violation" /
        "dos_prevention" /
        "unsupported_version" /
        "unexpected_packet" /
        "unexpected_source_connection_id" /
        "unexpected_version" /
        "duplicate" /
        "invalid_initial"
  1. Add a details field to the event. I have it as a JSON object instead of a text but I don't mind either way.
TransportPacketDropped = {
    ; primarily packet_type should be filled here,
    ; as other fields might not be parseable
    ? header: PacketHeader

    ? raw: RawInfo
    ? datagram_id: uint32

    ? details: {* text => any}

    ; list of leftover triggers here:
    ; ...
}

I'd put the "error type" in the details field as well then @LPardue, feeling that the trigger can act as an "error category" indicator.

rmarx avatar Sep 28 '22 13:09 rmarx

Looking at our (currently 79) drop reasons in MsQuic, I think the following top-level categories would be reasonable:

  • Internal Error (not initialized, OOM, etc.)
  • Rejected (internal limits reached, ddos protection, unwilling to track more paths, duplicate)
  • Unsupported (unknown/unsupported version)
  • Invalid Packet (packet parsing/validation failed)
  • Connection Unknown (CID doesn't match)
  • Decryption Failure (decrypt failed or key unavailable)

nibanks avatar Sep 28 '22 14:09 nibanks

I like Nicks categories. All I might add is a "General" category for anything else, where details can be added or not at operator discretion

LPardue avatar Sep 28 '22 16:09 LPardue