Flaky in in_process_relay_test: "connection refused"
I observed the following flake:
$ bazel test //src/go/tests/relay:in_process_relay_test --runs_per_test=100 --test_output=errors
[...]
2025/03/20 15:22:04 ERROR BackendRequest ID=server_name:6ec6703aa3b52e4291e4ae3fbfa87bbc Message="Backend request failed with error: Get \"http://127.0.0.1:34391/\": dial tcp 127.0.0.1:34391: connect: connection refused"
[...]
--- FAIL: TestHttpErrorPropagation (0.04s)
--- FAIL: TestHttpErrorPropagation/Propagate_http.StatusHTTPVersionNotSupported (0.00s)
in_process_relay_test.go:195: Server responeded with an unexpected status code.
Expected: 505
Observed: 500
[...]
//src/go/tests/relay:in_process_relay_test FAILED in 3 out of 100 in 1.4s
I'll add flaky=true to mitigate this.
Here are my investigations
https://github.com/googlecloudrobotics/core/blob/acee658b211af5a592ec9eefb7f9de0e603eb99f/src/go/tests/relay/in_process_relay_test.go#L105-L107
It is possible for the request to come in before the backend is ready.
https://github.com/googlecloudrobotics/core/blob/acee658b211af5a592ec9eefb7f9de0e603eb99f/src/go/tests/relay/in_process_relay_test.go#L27-L39
The tests listens on port 0, letting the kernel pick a free port, immediately close the listener, then use the same port to start the server. But closing the listener does not immediately free up the port, it is possible for the address to still be bounded when starting the server.
Some ideas to fix:
- Use unix sockets, but I don't think the current code supports it.
- Allow to the relay server with a bounded listener, this is not currently supported as well.
- Retry until success, does not require changes in the code, but is more hacky and makes the test slower.
Normally this is solved by SO_REUSEADDR and some people wrote https://pkg.go.dev/github.com/portmapping/go-reuse for that
Can't we simplify pickUnusedPortOrDie()´ to skip ResolveTCPAddr? The docs say If the Port field of laddr is 0, a port number is automatically chosen.`.
func pickUnusedPortOrDie() int {
if list, err := net.Listen("tcp", "localhost:0"); err == nil {
defer list.Close()
return list.Addr().(*net.TCPAddr).Port
}
glog.Fatal("Failed to pick a free TCP port.")
return 0
}
I'll give this a try.
I think ResolveTCPAddr is just being used to map localhost to 127.0.0.1 there. I think if we do net.Listen() in the test process and then pass the listener into srv.Serve() we could avoid the need to close and reopen the port (which also races against someone else deciding to use the port) - but maybe there's a good reason we didn't do that in the first place?
Okay, the issue is that golang is not supporting SO_REUSEPORT. There are a ton of freeport libs that all one way or the other suffer from the same issue:
- https://github.com/monogon/monogon/blob/main/osbase/test/freeport/freeport.go
- https://github.com/projectdiscovery/freeport/blob/main/freeport.go
- https://github.com/ansidev/freeport/blob/main/freeport.go
The only workaround I could find would be https://github.com/portmapping/go-reuse/tree/master or directly using the approach in https://gist.github.com/joliver/4ccd58605e07e8edf71904b172d95513 (but then we need golang.org/x/sys/unix).