go-libp2p
go-libp2p copied to clipboard
quic: require address validation when under load
Address validation makes sure that the client actually has the IP address that it claims to have. By validating the address before sending a lot of data (e.g. the certificate chain) to the client, we avoid being used for reflection attacks. However, this adds one roundtrip to connection setup.
We should come up with some heuristics. An easy method would be counting the number of connection attempts per a fixed duration, and only require address validation if more connections are received.
Adding a bit of context here:
- Opening a QUIC packet and starting a TLS handshake is expensive. One attack on a QUIC stack is to flood the server with ClientHello packets (potentially with spoofed UDP source addresses) to generate a lot of CPU load. To defend against this attack, QUIC introduced the Retry mechanism.
- Address validation is done using the
AcceptTokencallback. It will be useful to look at thedefaultAcceptTokenimplementation in quic-go. - In go-libp2p, this is wired up in https://github.com/libp2p/go-libp2p/blob/884028550c9b4d1d0b1090aaf18e087bff3fa8bf/p2p/transport/quic/transport.go#L46-L49.
- We can figure out the number of QUIC handshake started by counting how many times
AcceptTokenwas called with 1. no token or 2. a token that's not a Retry token - We can't count the number of failed handshakes (quic-go will swallow them), but we can count the number of successful (incoming) connections
- There might be a slight overcounting if the clients send multiple Initial packets, or if an Initial packet is duplicated. I don't think we need to be worried about that.
- We should require address validation if the number of failed handshakes in the last
xminutes is > 50% (configurable). When implementing this logic, we need to be careful to handle the cold-boot case (a node just booted up, is receiving a bunch of incoming connections: now we have a high incoming connection count, but because the QUIC handshake takes 1 RTT, none of them have completed yet).
Adding a bit of context here:
- Opening a QUIC packet and starting a TLS handshake is expensive. One attack on a QUIC stack is to flood the server with ClientHello packets (potentially with spoofed UDP source addresses) to generate a lot of CPU load. To defend against this attack, QUIC introduced the Retry mechanism.
- Address validation is done using the
AcceptTokencallback. It will be useful to look at thedefaultAcceptTokenimplementation in quic-go.- In go-libp2p, this is wired up in https://github.com/libp2p/go-libp2p/blob/884028550c9b4d1d0b1090aaf18e087bff3fa8bf/p2p/transport/quic/transport.go#L46-L49.
- We can figure out the number of QUIC handshake started by counting how many times
AcceptTokenwas called with 1. no token or 2. a token that's not a Retry token- We can't count the number of failed handshakes (quic-go will swallow them), but we can count the number of successful (incoming) connections
- There might be a slight overcounting if the clients send multiple Initial packets, or if an Initial packet is duplicated. I don't think we need to be worried about that.
- We should require address validation if the number of failed handshakes in the last
xminutes is > 50% (configurable). When implementing this logic, we need to be careful to handle the cold-boot case (a node just booted up, is receiving a bunch of incoming connections: now we have a high incoming connection count, but because the QUIC handshake takes 1 RTT, none of them have completed yet).
Is it a percentage for each address or only one global percentage to check?
Is needed to save the timestamp of each request to discard them from the calculation after for examples 5 minutes, or is better to have fixed ranges (e.g. 16:00-16:05, clear the counter, 16:05-16:10, ...)? maybe the second solution requires less resources but it's less precise because at 16:06 you have only data for the last minute.
Is there a reason to not enable the check for all the incoming connections? Less performance due to the interface assertion and method invocation?
Is it a percentage for each address or only one global percentage to check?
At this point during the handshake, we haven’t validated the address yet. The UDP source address could be spoofed! So it’s a global percentage (per transport).
Is needed to save the timestamp of each request to discard them from the calculation after for examples 5 minutes, or is better to have fixed ranges (e.g. 16:00-16:05, clear the counter, 16:05-16:10, ...)? maybe the second solution requires less resources but it's less precise because at 16:06 you have only data for the last minute.
It doesn’t need to be precise. Saving an entry for every request would be an attack vector in itself. At 16:06, we should consider some more data though. One option would be to save the counts for each of the last 5 minutes, and use those to make the decision.
Is there a reason to not enable the check for all the incoming connections? Less performance due to the interface assertion and method invocation?
Not sure I understand what you mean. Do you mean sending a Retry packet? It adds 1 RTT to the handshake latency, so you want to avoid it in the common case, and only do it when you’re under attack.
I missed a thing, the callback function is called before or after the retry mechanism? Reading better the function it seems to me that it's used in both the cases.
For the implementation yeah, we could use something like a circular buffer using an array, maintaining the newded historical data and overwriting one value every y seconds/minutes (with y smaller than x, time window). For the performances y maybe 1/3 or 1/5 of x, with correspondent array length of 3 and 5. And for the cold boot, maybe in the first x minutes we set the array elements (it will complete one array iteration) but not apply the retry policy.
Both the minutes and the percentage have to be set from the caller?
I missed a thing, the callback function is called before or after the retry mechanism? Reading better the function it seems to me that it's used in both the cases.
I admit the API is a bit confusing. It’s called every time the server processes a QUIC Initial packet that doesn’t belong to any connection. So it is indeed called before and after the Retry is sent (first time the quic.Token will be nil, the second time not).
Both the minutes and the percentage have to be set from the caller?
We should probably provide reasonable defaults, and then provide a way for the user to overwrite those. If you want, you can ignore the configurability for your first pass, and add it later.
Both the minutes and the percentage have to be set from the caller?
We should probably provide reasonable defaults, and then provide a way for the user to overwrite those. If you want, you can ignore the configurability for your first pass, and add it later.
I see that it's used in the NewTransport method, maybe we can add some arguments to this function to set them, or do you plan to set them directly in a config file?
In the default implementation:
if udpAddr, ok := clientAddr.(*net.UDPAddr); ok { sourceAddr = udpAddr.IP.String() } else { sourceAddr = clientAddr.String() }
Isn't QUIC implemented using UDP? Is this a check for possible future changes?
I see that it's used in the NewTransport method, maybe we can add some arguments to this function to set them, or do you plan to set them directly in a config file?
We should probably implement the option pattern, like we do in the TCP transport: https://github.com/libp2p/go-libp2p/blob/884028550c9b4d1d0b1090aaf18e087bff3fa8bf/p2p/transport/tcp/tcp.go#L129-L131
Then we can add a WithRetry(fraction float, period time.Duration) option.
In the default implementation: if udpAddr, ok := clientAddr.(*net.UDPAddr); ok { sourceAddr = udpAddr.IP.String() } else { sourceAddr = clientAddr.String() } Isn't QUIC implemented using UDP? Is this a check for possible future changes?
Correct, but we allow the user to pass us a net.PacketConn (which is an interface). In principle, this could return a net.Addr that's not a *net.UDPAddr. In fact, someone just did that: https://github.com/lucas-clemente/quic-go/issues/3445 (totally unrelated to this issue though).
Ok, we made an implementation in a fork. Do we need to test it and, if yes, how? Is there some standard test for a real usage case?
Can you create a PR? I’ll review later today.
Do we need to test it and, if yes, how?
Yes, we require all of our changes to be tested. Testing this code can seem a bit tricky at first, but this should help:
- On the client, you can learn if a Retry packet was received by using
ReceivedRetryon a tracer: https://pkg.go.dev/github.com/lucas-clemente/[email protected]/logging#ConnectionTracer - The timing-dependent code can be tested introducing a
WithClock()option, like here: https://github.com/libp2p/go-libp2p/blob/884028550c9b4d1d0b1090aaf18e087bff3fa8bf/p2p/discovery/backoff/backoffcache.go#L85-L91 The clock can then be mocked in tests, like here: https://github.com/libp2p/go-libp2p/blob/master/p2p/discovery/backoff/backoffcache_test.go#L189