xds-relay
xds-relay copied to clipboard
Implement retries on stream open failures from orchestrator -> origin server
Currently, we just error out. Implement configurable retry and back-off logic to re-attempt to open the stream on failures.
Currently we force close all downstream connections, forcing downstream clients to issue a retry. or we could just add it to the issue too
Ideally the upstream.Recv()
calls in xds-relay will be guided by retry policies set up like this https://github.com/grpc/grpc-go/blob/master/examples/features/retry/client/main.go#L51.
Since Recv
calls are idempotent, this the retry policy settings should help us with basic error handling.
I spent some time clarifying how/if we could rely on framework-level retries to handle this. After talking to one of the grpc maintainers on the official gitter, it turns out that the ability of bidi grpc streams to use transparent retries is very limited.
Based on this, we have to implement retries at the application level.
I had a change in the works to handle retries (with exponential backoff) in the call to CreateWatch
in the orchestrator, but I didn't notice that the biggest problem is actually failures in WatchUpstream
. Can we, as suggested in https://github.com/envoyproxy/xds-relay/pull/53/files#r414893043, close all downstreams and restart the process? If so, how?
That's a great discussion. Even the grpc-middleware retry does not allow retry on clientstreams.
From the upstreamclient perspective, an error which could not be transparently handled (here, here) always causes the stream to be cancelled. The application level retry probably belongs here.
IMO there's a catch to application level exponential backoff for rpc failures. A stream of incoming messages will cause stream level flow control to kick in and the grpc send method will become blocking. Also we will need a way to handle order. Due to the complexity of retry, my initial thought was to abandon the stream and open a new one on failures(subject to backoff), since stream creations are lightweight.
When I started investigating this I toyed with this of defining our own interceptor to handle retries, but abandoned that pursuit in favor of framework-level retries. Maybe it's time to seriously consider that again? This way we wouldn't have to worry about message order, right, @jyotimahapatra ?
We should explore the option. I'm not sure if we will have to build an application level buffer to save messages, or whether an interceptor introduces any complexity in case of flow control. We should explore that option to find out more details. We probably also need to understand what kind of errors are retry-able. connection errors/goaway etc warrant creating a whole new connection/stream. I don/t think we cover that today. What are your thoughts about the the situation when retries are exhausted? These questions probably will lead us to add more basic features of connection/stream recreation(subject to backoff), which will be necessary beyond rpc retries.
From my reading of grpc-go, the connection failures are transparently handled by grpc. We should be ok with just implementing a retry interceptor for bidi rpc.
I can confirm grpc tries really hard to handle transient failures, it tries to reestablish the connection with a exponential back retry policy. I simulated this by cranking the GRPC logs via setting the GRPC_GO_LOG_VERBOSITY_LEVEL and GRPC_GO_LOG_SEVERITY_LEVEL environment variables to, respectively, 99 and info while killing the upstream management server the xds-relay instance was connected to and then bringing it up again. In the logs we can see this:
{"level":"info","ts":1593607979.965913,"logger":"upstream_client","msg":"Initiating upstream connection","address":"127.0.0.1:18000"}
INFO: 2020/07/01 05:52:59 parsed scheme: ""
INFO: 2020/07/01 05:52:59 scheme "" not registered, fallback to default scheme
INFO: 2020/07/01 05:52:59 ccResolverWrapper: sending update to cc: {[{127.0.0.1:18000 <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/07/01 05:52:59 ClientConn switching balancer to "pick_first"
INFO: 2020/07/01 05:52:59 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:52:59 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:52:59 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:00 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:00 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:00 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:02 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:02 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:02 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:04 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:04 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:04 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:06 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:06 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:06 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:08 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:08 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:08 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:10 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
INFO: 2020/07/01 05:53:10 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {READY <nil>}
{"level":"info","ts":1593607990.425799,"msg":"Initializing server","address":"127.0.0.1:9991"}
{"level":"info","ts":1593607990.4258611,"msg":"Starting admin server","address":"127.0.0.1:6070"}
INFO: 2020/07/01 05:53:22 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
INFO: 2020/07/01 05:53:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/07/01 05:53:22 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:22 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:23 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:23 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:23 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:25 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:25 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:25 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:27 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:27 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:27 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:29 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:29 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:29 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:30 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:30 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:30 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:33 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
WARNING: 2020/07/01 05:53:33 grpc: addrConn.createTransport failed to connect to {127.0.0.1:18000 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused". Reconnecting...
INFO: 2020/07/01 05:53:33 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:18000: connect: connection refused"}
INFO: 2020/07/01 05:53:35 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {CONNECTING <nil>}
INFO: 2020/07/01 05:53:35 pickfirstBalancer: HandleSubConnStateChange: 0xc0006ad490, {READY <nil>}
In the logs above notice how xds-relay tries to establish a connection to an inexistent management server, then it finally connects to it and resumes initialization, only to re-attempt to establish the connection again upon failure (at that point I had killed the management server instance). It finally reestablishes the connection once I bring the management server instance up again.
Retrying on an already opened stream is not possible. I tried the approach of simply retrying in the case of unavailable
errors in calls to RecvMsg
in https://github.com/envoyproxy/xds-relay/pull/103 and all subsequent retries failed with the same error. We have to open a stream and then retry in cases like this. I'm going to update that PR to do this.