xds-relay
xds-relay copied to clipboard
Integration tests are exhibiting flaky behavior
Seems to be failing more frequently:
=== RUN TestXdsClientGetsIncrementalResponsesFromUpstreamServer
2020/05/12 20:31:22 management server listening on 19001
TestXdsClientGetsIncrementalResponsesFromUpstreamServer: upstream_client_test.go:47:
Error Trace: upstream_client_test.go:47
Error: Setup failed: %s
Test: TestXdsClientGetsIncrementalResponsesFromUpstreamServer
Messages: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:19001: connect: connection refused"
And,
=== RUN TestServerShutdownShouldCloseResponseChannel
2020/05/12 18:43:45 listen tcp :19001: bind: address already in use
FAIL github.com/envoyproxy/xds-relay/integration 7.554s
FAIL
make: *** [integration-tests] Error 1
As we mentioned during our sync meeting today, these are 2 separate errors:
-
This is related to https://github.com/envoyproxy/xds-relay/blob/master/integration/upstream_client_test.go#L208-L218. As @LisaLudique pointed out, a connection refused means that nothing is listening on that port, which can happen if the goroutine that starts the management server doesn't run before we try to connect to it. @jyotimahapatra created an issue to configure the grpc options in that call to the management server, so we could experiment configuring retries there. Fundamentally though, the problem really is because we don't have a way to signal that the management server is ready before we try to connect to it.
-
This is caused by multiple test trying start the management server on the same port. We pass in a context in the goroutine to start the management server, which gracefully shuts down the server. We don't have much insight into how the go runtime schedules the tests (besides the fact that in the same package they are run serially), but if I were to guess, at least on Linux we're not leaving enough time for the interaction between the runtime and the OS to go through the usual workflow of closing the TCP connection. In the e2e tests we wait 1 second between tests, we might adopt a similar approach in the integration tests suite (as a side note, 1s is too much).
We just hit this in macOS in CI: https://github.com/envoyproxy/xds-relay/pull/87/checks?check_run_id=687179666