pgconn icon indicating copy to clipboard operation
pgconn copied to clipboard

Dial i/o timeouts while connecting

Open adamconnelly opened this issue 3 years ago • 3 comments

I'm periodically seeing connection failures when trying to connect to an Aurora Serverless instance. Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems. The error messages look something like this:

failed to connect to `host=xyz user=xyz database=xyz`: dial error (dial tcp x.x.x.x:5432: i/o timeout)

We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.

The connection attempt times out after 60 seconds, which makes sense because of this line, and the error message is coming from here.

While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn is returned. I guess what I'm wondering is would it make sense to return ErrBadConn when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.

I'm happy to experiment with this, but I just wanted to ask first since I'm not mega familiar with Go SQL drivers.

Thanks in advance!

adamconnelly avatar Apr 22 '22 13:04 adamconnelly

Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems.

That error message does seem to indicate a network or server issue.

We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.

That is pretty old, but off the top of my head I don't recall any changes that would affect this.

While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn is returned. I guess what I'm wondering is would it make sense to return ErrBadConn when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.

I'm not absolutely sure but I think the ErrBadConn logic only applies when you have an existing connection. I don't see how that would be effective in the dialing process.

jackc avatar Apr 22 '22 20:04 jackc

Thanks for the reply - I'll post an update if I find out anything interesting through testing.

adamconnelly avatar Apr 26 '22 08:04 adamconnelly

@jackc I did some more investigation, and I'm pretty certain that the ErrBadConn approach would work. It certainly resulted in retries for dial failures.

However, in the end I've actually taken the approach of replacing the DialFunc using something like this:

wrappedDial := config.DialFunc
config.DialFunc = func(ctx context.Context, network, addr string) (net.Conn, error) {
	var conn net.Conn
	var err error
	for i := 0; i < pgMaxDialAttempts; i++ {
		ok := func() bool {
			// We're manually enforcing a dial timeout here rather than relying on connect_timeout
			// in the connection string because the connect_timeout applies to the full connection
			// process, meaning that any dial retries would fail because the context has already expired.
			ctx, cancel := context.WithTimeout(ctx, time.Second*5)
			defer cancel()
			conn, err = wrappedDial(ctx, network, addr)

			return err == nil
		}()

		if ok {
			break
		}
	}

	return conn, err
}

That seems to have worked (in that an initial dial times out after 5 seconds, but a subsequent dial succeeds), although unfortunately since implementing it I've only seen one example of the failure, so it's difficult to be certain.

I'm happy to close the issue if you want since I've got a workaround now, but just figured the info could be useful.

adamconnelly avatar May 13 '22 16:05 adamconnelly