xds-relay icon indicating copy to clipboard operation
xds-relay copied to clipboard

Integration tests are exhibiting flaky behavior

Open jessicayuen opened this issue 4 years ago • 2 comments

Seems to be failing more frequently:

=== RUN   TestXdsClientGetsIncrementalResponsesFromUpstreamServer
2020/05/12 20:31:22 management server listening on 19001
    TestXdsClientGetsIncrementalResponsesFromUpstreamServer: upstream_client_test.go:47: 
        	Error Trace:	upstream_client_test.go:47
        	Error:      	Setup failed: %s
        	Test:       	TestXdsClientGetsIncrementalResponsesFromUpstreamServer
        	Messages:   	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:19001: connect: connection refused"

And,

=== RUN   TestServerShutdownShouldCloseResponseChannel
2020/05/12 18:43:45 listen tcp :19001: bind: address already in use
FAIL	github.com/envoyproxy/xds-relay/integration	7.554s
FAIL
make: *** [integration-tests] Error 1

jessicayuen avatar May 12 '20 20:05 jessicayuen

As we mentioned during our sync meeting today, these are 2 separate errors:

  1. This is related to https://github.com/envoyproxy/xds-relay/blob/master/integration/upstream_client_test.go#L208-L218. As @LisaLudique pointed out, a connection refused means that nothing is listening on that port, which can happen if the goroutine that starts the management server doesn't run before we try to connect to it. @jyotimahapatra created an issue to configure the grpc options in that call to the management server, so we could experiment configuring retries there. Fundamentally though, the problem really is because we don't have a way to signal that the management server is ready before we try to connect to it.

  2. This is caused by multiple test trying start the management server on the same port. We pass in a context in the goroutine to start the management server, which gracefully shuts down the server. We don't have much insight into how the go runtime schedules the tests (besides the fact that in the same package they are run serially), but if I were to guess, at least on Linux we're not leaving enough time for the interaction between the runtime and the OS to go through the usual workflow of closing the TCP connection. In the e2e tests we wait 1 second between tests, we might adopt a similar approach in the integration tests suite (as a side note, 1s is too much).

eapolinario avatar May 14 '20 19:05 eapolinario

We just hit this in macOS in CI: https://github.com/envoyproxy/xds-relay/pull/87/checks?check_run_id=687179666

eapolinario avatar May 18 '20 23:05 eapolinario