pgx icon indicating copy to clipboard operation
pgx copied to clipboard

unexpected EOF error

Open picollomartin opened this issue 2 years ago • 12 comments

Hi 👋

First of all thank you for the amazing work here 🙌

Recently we migrate from pq driver to pgx and faced one issue that I would like to share with you

We started some random error in our logs that matchs with io.ErrUnexpectedEOF which we track down and it comes from here: https://github.com/jackc/pgx/blob/e58381ac9409172a674f76c1801c3c606ac41b42/pgproto3/frontend.go#L210-L215

This previously was hidden from us because pq does this:

if v == io.EOF || v.Error() == "remote error: handshake failure" {
	*err = driver.ErrBadConn
} else {
	*err = v
}

That is retried by database/sql (getting/opening another connection):

for i := 0; i < maxBadConnRetries; i++ {
  stmt, err = db.prepare(ctx, query, cachedOrNewConn)
  isBadConn = errors.Is(err, driver.ErrBadConn)
  if !isBadConn {
	  break
  }
}
if isBadConn {
   return db.prepare(ctx, query, alwaysNewConn)
}

Makes sense wrapping the io.EOF into a driver.ErrBadConn ? This has the advantage that is out-of-the-box retried by the standard database/sql but maybe i'm missing another use case for that error

Another approach could be some migration guide (from pq to pgx) with a warning about this error

picollomartin avatar Dec 16 '22 18:12 picollomartin

Makes sense wrapping the io.EOF into a driver.ErrBadConn ? This has the advantage that is out-of-the-box retried by the standard database/sql but maybe i'm missing another use case for that error

I think that's reasonable. It would need to be done in the stdlib package instead of pgproto3 to avoid adding the database/sql dependency to that package. That would be tedious though. That error could happen nearly anywhere, so every error handling path would need to add the error wrapping / translation.

Another approach could be some migration guide (from pq to pgx) with a warning about this error

I'm sure some people would find a pq to pgx migration guide useful, but it's not something I can personally do. I've never used pq in any significant way.

jackc avatar Dec 17 '22 15:12 jackc

I think that's reasonable. It would need to be done in the stdlib package instead of pgproto3 to avoid adding the database/sql dependency to that package. That would be tedious though. That error could happen nearly anywhere, so every error handling path would need to add the error wrapping / translation.

Ok, I'm going to try this and make it as painless as possible 🙃

I'm sure some people would find a pq to pgx migration guide useful, but it's not something I can personally do. I've never used pq in any significant way.

Without this error, if you're using database/sql is pretty straightforward

picollomartin avatar Dec 19 '22 19:12 picollomartin

@picollomartin please give a try to this (untested) patch

0001-proof-of-concept-for-1435.zip

ExecContext and QueryContext already have a SafeToRetry check, so I didn't add the error translation there

drakkan avatar Dec 25 '22 11:12 drakkan

Hey @drakkan ! 👋 Sorry for the delay

Thank you for that patch 🙌

I tested with some local script but got the same error exactly in QueryContext method, SafeToRetry does not include this error because unexpected EOF does not implement the following interface:

// SafeToRetry checks if the err is guaranteed to have occurred before sending any data to the server.
func SafeToRetry(err error) bool {
	if e, ok := err.(interface{ SafeToRetry() bool }); ok {
		return e.SafeToRetry()
	}
	return false
}

Like @jackc mentioned earlier i think that this translation of error should be nearly everywhere including QueryContext/ExecContext

picollomartin avatar Dec 28 '22 02:12 picollomartin

Hello @picollomartin

ok, please give a try to this variant

0001-proof-of-concept-for-1435.zip

if you confirm it works I'll open a proper PR, thanks

drakkan avatar Dec 28 '22 08:12 drakkan

Not sure if I have to translate also this error:

https://github.com/jackc/pgx/blob/74f9b9f0a483f95513c621364f2c3912181ee360/stdlib/sql.go#L730

and maybe BeforeConnect

https://github.com/jackc/pgx/blob/74f9b9f0a483f95513c621364f2c3912181ee360/stdlib/sql.go#L200

and AfterConnect

https://github.com/jackc/pgx/blob/74f9b9f0a483f95513c621364f2c3912181ee360/stdlib/sql.go#L208

should not be translated (the current patch checks also these errors)

drakkan avatar Dec 28 '22 08:12 drakkan

@jackc based on the lib/pq code here, it seems we have to store ErrBadConn and add a logic like this. What do you think about?

drakkan avatar Dec 28 '22 19:12 drakkan

@drakkan Not really sure about the context. It'd be easier to review as a draft PR.

jackc avatar Dec 28 '22 19:12 jackc

@drakkan Sorry, my original response was inaccurate / incomplete. Your mention of lib/pq just reminded me of why pgx is very careful about returning ErrBadConn.

From the database/sql/driver package docs:

ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.

To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn't return ErrBadConn.

lib/pq violates the spec and risks duplicate operations. See https://github.com/lib/pq/issues/939.

I think stdlib is already returning ErrBadConn where everywhere it can, but if there are any other sites you think should be then we need to be very sure that it only happens when there is no change the server performed the operation.

jackc avatar Dec 28 '22 23:12 jackc

@drakkan your last patch works for me locally

Taking you last comment @jackc i saw the issue of pq and is a pretty risky approach, idk if io.ErrUnexpectedEOF is in the same group of possibly of duplicating operations if retried, if that's the case i agree in not to do any risky translation like pq does and maybe just clear that up in docs (network errors (like io.ErrUnexpectedEOF) are risky to retry because can lead to duplication of some operations, if you see this error frequently check for network issues between the driver and the database (e.g client timeout in a proxy))

That maybe helps to close other issues related to this error like #966, #1238

picollomartin avatar Dec 29 '22 02:12 picollomartin

@picollomartin thanks for your tests.

@jackc you are right. my patch can trigger duplicate inserts, this test case

func TestNetworErrors(t *testing.T) {
	testWithAllQueryExecModes(t, func(t *testing.T, db *sql.DB) {
		_, err := db.Exec("drop table if exists t")
		require.NoError(t, err)

		_, err = db.Exec("create table t(id serial)")
		require.NoError(t, err)

		_, err = db.Exec(`
			begin;
			insert into t(id) values (default);
			commit;
			select pg_terminate_backend(pg_backend_pid()) where 1 = (select count(*) from t);
		`)
		require.Error(t, err)

		var numInserts int
		err = db.QueryRow("select count(*) from t").Scan(&numInserts)
		require.NoError(t, err)
		require.Equal(t, 1, numInserts)
	})
}

works as expected but if I change transalteError to also catch SQLSTATE 57P01 I get a duplicate insert. Thanks for your work!

drakkan avatar Dec 29 '22 07:12 drakkan

Can y'all please link the patch in this issue? I see it mentioned a few times, but no link to it so we can follow when this fix is or has been merged...

sicoyle avatar Apr 05 '24 15:04 sicoyle