core icon indicating copy to clipboard operation
core copied to clipboard

Flaky in in_process_relay_test: "connection refused"

Open drigz opened this issue 9 months ago • 5 comments

I observed the following flake:

$ bazel test //src/go/tests/relay:in_process_relay_test --runs_per_test=100 --test_output=errors
[...]
2025/03/20 15:22:04 ERROR BackendRequest ID=server_name:6ec6703aa3b52e4291e4ae3fbfa87bbc Message="Backend request failed with error: Get \"http://127.0.0.1:34391/\": dial tcp 127.0.0.1:34391: connect: connection refused"
[...]
--- FAIL: TestHttpErrorPropagation (0.04s)
    --- FAIL: TestHttpErrorPropagation/Propagate_http.StatusHTTPVersionNotSupported (0.00s)
        in_process_relay_test.go:195: Server responeded with an unexpected status code.
                Expected: 505
                Observed: 500
[...]
//src/go/tests/relay:in_process_relay_test                               FAILED in 3 out of 100 in 1.4s

I'll add flaky=true to mitigate this.

drigz avatar Mar 20 '25 15:03 drigz

Here are my investigations

https://github.com/googlecloudrobotics/core/blob/acee658b211af5a592ec9eefb7f9de0e603eb99f/src/go/tests/relay/in_process_relay_test.go#L105-L107

It is possible for the request to come in before the backend is ready.

https://github.com/googlecloudrobotics/core/blob/acee658b211af5a592ec9eefb7f9de0e603eb99f/src/go/tests/relay/in_process_relay_test.go#L27-L39

The tests listens on port 0, letting the kernel pick a free port, immediately close the listener, then use the same port to start the server. But closing the listener does not immediately free up the port, it is possible for the address to still be bounded when starting the server.

Some ideas to fix:

  • Use unix sockets, but I don't think the current code supports it.
  • Allow to the relay server with a bounded listener, this is not currently supported as well.
  • Retry until success, does not require changes in the code, but is more hacky and makes the test slower.

koonpeng avatar Mar 25 '25 05:03 koonpeng

Normally this is solved by SO_REUSEADDR and some people wrote https://pkg.go.dev/github.com/portmapping/go-reuse for that

ensonic avatar Oct 20 '25 14:10 ensonic

Can't we simplify pickUnusedPortOrDie()´ to skip ResolveTCPAddr? The docs say If the Port field of laddr is 0, a port number is automatically chosen.`.

func pickUnusedPortOrDie() int { 
	if list, err := net.Listen("tcp", "localhost:0"); err == nil { 
		defer list.Close() 
		return list.Addr().(*net.TCPAddr).Port 
	} 
 	glog.Fatal("Failed to pick a free TCP port.") 
 	return 0 
} 

I'll give this a try.

ensonic avatar Oct 20 '25 14:10 ensonic

I think ResolveTCPAddr is just being used to map localhost to 127.0.0.1 there. I think if we do net.Listen() in the test process and then pass the listener into srv.Serve() we could avoid the need to close and reopen the port (which also races against someone else deciding to use the port) - but maybe there's a good reason we didn't do that in the first place?

drigz avatar Oct 20 '25 14:10 drigz

Okay, the issue is that golang is not supporting SO_REUSEPORT. There are a ton of freeport libs that all one way or the other suffer from the same issue:

  • https://github.com/monogon/monogon/blob/main/osbase/test/freeport/freeport.go
  • https://github.com/projectdiscovery/freeport/blob/main/freeport.go
  • https://github.com/ansidev/freeport/blob/main/freeport.go

The only workaround I could find would be https://github.com/portmapping/go-reuse/tree/master or directly using the approach in https://gist.github.com/joliver/4ccd58605e07e8edf71904b172d95513 (but then we need golang.org/x/sys/unix).

ensonic avatar Oct 21 '25 07:10 ensonic