mysql icon indicating copy to clipboard operation
mysql copied to clipboard

pr-451 with additional tests and some documentation editing

Open jvshahid opened this issue 7 years ago • 11 comments

Description

This is another attempt to get the changes in #451 to be merged. We have preserved the author of the original commit and made the suggested documentation changes and added a test case. I will attempt to explain what the issue is trying to fix:

We have a MariaDB cluster with a proxy routing traffic to one node in the cluster. During rolling updates, the active node will send ER_CONNECTION_CLOSED err packet and close the connection. This causes the driver to return ErrPktSync to the client. Given that the driver is not returning ErrBadConn golang will not retry the query (or close the connection). This pushes the responsibility of retrying on the client which is totally acceptable. Except If there are few hundred connections open (in our case 300) any retry logic with reasonable number of retries (in our case 3) isn't sufficient. What will end up happening, golang will pick another connection from the pool of closed connections and the client will get another ErrPktSync.

I'm happy to talk more about different solutions to the problem we are having.

Checklist

  • [x] Code compiles correctly
  • [x] Created tests which fail without the change (if possible)
  • [x] All tests passing
  • [ ] Extended the README / documentation, if necessary
  • [x] Added myself / the copyright holder to the AUTHORS file

jvshahid avatar Jun 30 '17 16:06 jvshahid

I'm -1 to retry. It's not safe.

We have a MariaDB cluster with a proxy routing traffic to one node in the cluster. During rolling updates,

DB.SetConnMaxlifetime() solves many problems including it. For example,

  • DB.SetConnMaxLifetime(time.Second * 10)
  • Stop proxying new connection to retiring DB.
  • Wait 20 seconds. (If you have very long transaction, you should wait 10sec + maximum transaction period)
  • Check all connection to retiring DB is closed.
  • Shutdown retiring DB.

In this way, you can safely rolling update your DBs behind proxy.

methane avatar Jul 01 '17 05:07 methane

I'm -1 to retry. It's not safe.

Can you elaborate why you think it is not safe and also define exactly which action you think is unsafe.

/cc @arnehormann for context on the #451

jvshahid avatar Jul 01 '17 18:07 jvshahid

See #302.

methane avatar Jul 01 '17 23:07 methane

Or does MariaDB document guarantee that already sent query doesn't take any effect when ER_CONNECTION_KILLED is returned?

There are many special cases (e.g. queries which arn't transactional, multi statement, stored procedure, or anything else I don't know yet).

methane avatar Jul 02 '17 00:07 methane

One example I can conceive is killed while filesort (I think MariaDB support it):

UPDATE t1 SET x=x+1 WHERE id=42;
SELECT * FROM t2 ORDER BY c3;  /* can be killed while filesort */

Even if this case is safe to retry, I think we need documented guarantee about it's really safe to retry implicitly.

methane avatar Jul 02 '17 00:07 methane

@methane i had a chat with salle on the maria irc channel and he/she had more examples for why retry may not be safe. Some of the examples are:

  1. session variables and/or user variables are lost on reconnect but can affect query results
  2. as you mentioned before, depending on the engine some DML may not be trasactional

Does it make sense to make this an option in the connection string ? In our use case the queries are idempotent and won't cause issues on retry.

jvshahid avatar Jul 03 '17 14:07 jvshahid

Does it make sense to make this an option in the connection string ? In our use case the queries are idempotent and won't cause issues on retry.

I don't want to add options only for specific usecase.

If your query is idempotent, why not retry at application layer? Application knows query is idempotent, but driver doesn't know.

Additionally, I recommend DB.SetConnMaxLifetime(time.Second * 10) in general. Doesn't it solve your case?

methane avatar Jul 08 '17 00:07 methane

If your query is idempotent, why not retry at application layer? Application knows query is idempotent, but driver doesn't know.

I explained previously, retrying the query on the client side is not sufficient. We have a pool of 300 connections. Retrying N of times (where N < 300) will not work, since go will pick another connection from the pool and the query will fail with ErrPktSync. The reason ErrBadConn fixes the issue, is go will only try twice to run the query and if both attempts fail with ErrBadConn a new connection will be established. This will have the effect of eventually closing all connections in the old pool while keeping the queries from failing.

Additionally, I recommend DB.SetConnMaxLifetime(time.Second * 10) in general. Doesn't it solve your case?

Can you elaborate on how we can use this function to achieve a smooth query execution after a server restart ?

jvshahid avatar Jul 10 '17 14:07 jvshahid

If you have short enough MaxLifetime and long enough sleep between retry, no need to 300 retries. Or you can use DB.SetMaxIdleConns(0) before retrying.

Otherwise, you need to 300 retries at last.

methane avatar Jul 10 '17 15:07 methane

Can you elaborate on how we can use this function to achieve a smooth query execution after a server restart ?

It close all idle connections which created before 10 seconds ago. So if you have 300 connections, all of them are not used after 10 seconds. No need to send query and receive error.

If application and MySQL server on same LAN (~5ms latency), even 1 second is efficient enough.

methane avatar Jul 10 '17 15:07 methane

As was already illustrated, handling this on the driver side with ErrBadConn is not safe and will not be added.

The question is what alternatives there are. The best I can come up with right now is to return a distinct error type in the driver, which can then be used on the client side to react appropriately, e.g. with an attempt to flush the connection pool (setting db.SetMaxIdleConns(0) temporarily might work).

julienschmidt avatar May 26 '18 08:05 julienschmidt