Add a default timeout for Connector.Connect
Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.
When this happens, existing connections keep trying to connect, with our logs repeated output like this:
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
An existing call to Exec is locked up at this point:
# 0x46a8d4 time.Sleep+0x134 /u
sr/local/go/src/runtime/time.go:195
# 0x1c14f28 github.com/canonical/go-dqlite/internal/protocol.makeRetryStrategies.func1+0x48 /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:321
# 0x1c12799 github.com/Rican7/retry.shouldAttempt+0x39 /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:32
# 0x1c126f6 github.com/Rican7/retry.Retry+0x96 /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:19
# 0x1c131c4 github.com/canonical/go-dqlite/internal/protocol.(*Connector).Connect+0xa4 /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:74
# 0x1c42628 github.com/canonical/go-dqlite/driver.(*Connector).Connect+0x268 /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/driver/driver.go:261
# 0x1c05d22 database/sql.(*DB).conn+0x762 /u
sr/local/go/src/database/sql/sql.go:1393
# 0x1c0ce52 database/sql.(*Stmt).connStmt+0x252 /u
sr/local/go/src/database/sql/sql.go:2724
# 0x1c0c364 database/sql.(*Stmt).ExecContext+0x184 /u
sr/local/go/src/database/sql/sql.go:2628
# 0x46e7204 database/sql.(*Stmt).Exec+0x44 /u
sr/local/go/src/database/sql/sql.go:2651
This is problematic given Juju's architecture, because our worker loops can be blocked from actually shutting down.
Particularly bad is that this happens with db.Ping, which we use for health checks.
It looks like the app module doesn't support tweaking the driver parameters that control that retry loop. We can fix that easily enough. You're not seeing the affected connection retry forever, right? With the default parameters it should be capped at 1 second:
https://github.com/canonical/go-dqlite/blob/ab30b33a86583ff5b1c1245a6f1f9b844c1a8f1f/app/app.go#L215
https://github.com/canonical/go-dqlite/blob/ab30b33a86583ff5b1c1245a6f1f9b844c1a8f1f/internal/protocol/connector.go#L50-L52
It might be prudent to also wrap all the contexts in the driver with a timeout as well. Unfortunately, we can't provide our own contexts in those locations, so a context wrapped in a WithTimeout of 30 seconds or greater. If a commit can't complete within 30 seconds then it's probably never going to succeed and we should probably give up then.
https://github.com/canonical/go-dqlite/blob/fd457b7f08f8c466640e940d4d8fa26e8ee8ad1c/driver/driver.go#L517-L525
I don't mind providing a patch to the driver package for this.
You can do something like this to set a default timeout if not specified by the passed context:
_, ok := ctx.Deadline()
if !ok {
// Set default timeout of 30s if no deadline context provided.
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, time.Duration(30*time.Second))
defer cancel()
}
That way if someone does want a longer timeout you're not forcing a particular policy in the driver.
FWIW LXD sets a default timeout on the transaction rather than each query:
https://github.com/lxc/lxd/blob/master/lxd/db/query/transaction.go#L15-L16
Which has been useful while we transition to passing context into each query too.
It also avoids us accidentally keeping transactions open too long and blocking writes to the DB.
@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an Exec or ExecContext then it will block. The problem I'm showing is that this is still a possibility because the driver does a COMMIT (there are many more) with a context that isn't bound to a timeout. So can hang.
You can do something like this to set a default timeout if not specified by the passed context:
Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.
FWIW LXD sets a default timeout on the transaction rather than each query:
We do the same thing in Juju.
https://github.com/juju/juju/blob/09db847e363a757089dcb569158d7ea0dc3a3a7a/database/txn/transaction.go#L172-L173
Does cancelling the transaction not end the query too. I've not observed LXD getting stuck on Execs when the transaction is cancelled.
Except the context isn't available via the
*Txand the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.
Agreed. I suppose if using Exec and not ExecContext one would not expect there to be a default timeout in the driver and indefinite blocking is a possibility (in any driver, not just dqlite).
The reason I added the 10s timeout to BeginTx in LXD was I found that it did cause tx.Exec() to stop if dqlite was blocking for some reason. Because we did see some indefinite blocking before that and dont see them anymore.
Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?
That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.
So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.
https://github.com/canonical/go-dqlite/blob/fd457b7f08f8c466640e940d4d8fa26e8ee8ad1c/internal/protocol/protocol.go#L58-L62
That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.
Sounds like a plan to me.
So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.
https://github.com/canonical/go-dqlite/blob/fd457b7f08f8c466640e940d4d8fa26e8ee8ad1c/internal/protocol/protocol.go#L58-L62
Am I correct that the only place where a user can't provide a context is tx.Commit() ?
Not sure why the go/sql package has chosen to not support contexts for tx.Commit() (and I don't see any newly introduced extension that changes that), perhaps because that is kind of dangerous and leads to undefined behavior in general (has the tx succeeded or not?). In any case a COMMIT query in dqlite can't hang forever, because at some point (relatively quickly with default values, a few seconds) the leader will step down and stop trying to commit if it can't reach a majority.
@SimonRichardson are you positively seeing a tx.Commit() call that hangs?
The logs attached in the very first message of this issue seem a retry loop for getting a driver.Conn, which I would say is caused by Connector.Connect() not being passed a ctx with a deadline. I didn't check Go's code, but I'd expect the ctx passed to Connector.Connect() to be the same as whatever the user has passed to tx.ExecContext and friends.
If tx.Exec() is being used instead of tx.ExecContext() that might explaining it, and we may want to add a default timeout to our Connector.Connect() implementation, as also suggested by the Go documentation for Connector.Connect:
// The provided context.Context is for dialing purposes only
// (see net.DialContext) and should not be stored or used for
// other purposes. A default timeout should still be used
// when dialing as a connection pool may call Connect
// asynchronously to any query.
I'm a not totally convinced that a default timeout should be added to protocol.Call() though, as proposed here, since as said tx.Commit() should not hang, and everything else can be controlled by the user, unless I'm missing something.
Maybe the non tx related functions, such as plain
ExecPingetc. could have a default timeout added, and functions that accept a context likeExecContext,BeginTxetc could add one if the context supplied doesn't have a deadline?That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.
As mentioned, as far as I understand it the only place where we would need to add a default timeout to achieve the behavior above should be protocol.Connector.Connect().
are you positively seeing a tx.Commit() call that hangs?
Not 100% positive yet.
FWIW I kind of see why you wouldn't want tx.Commit() to get a context parameter. Essentially it seems that what go/sql supports is effectively sugar around transaction rollback (you just cancel the context you pass to the various tx methods), which is a normal database operation. On the other hand, once a COMMIT has been issued there's usually no way to rollback, and it either succeeds or fail.
If Juju's main use case for cancellation/rollback is graceful shutdown (e.g. of a worker), then I'd say that a top-level context that you can cancel would be the way to go (as opposed to timeouts), and I presume that's already the case in Juju's code. Assuming that's true, as long as all juju's database calls get passed a context linked to that top-level one, then all transactions should already immediately abort/rollback as soon as the top-level context is cancelled (except for tx.Commit() which will finish in at most a few seconds, possibly returning an error due to cluster unavailability). Relying on timeouts for graceful shutdown would actually be sub-optimal. If juju's database calls are not already all making use of context.Context, it'd be nice to change that, in order to support a more streamlined graceful shutdown sequence.
and I presume that's already the case in Juju's code
This is correct.
If juju's database calls are not already all making use of context.Context, it'd be nice to change that
We're using context.
and I presume that's already the case in Juju's code
This is correct.
If juju's database calls are not already all making use of context.Context, it'd be nice to change that
We're using context.
Ok, so assuming that the only hang observed is the one originally pasted at the very beginning of this issue (the connect loop of protocol.Connector.Connect()), we'd need to investigate why the top-level Juju's context cancellation is not propagated down to that loop.
Let me get back to you on this. I've not been able replicate it.
First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:
- Run some successful queries against an in-memory cluster of
dqlite.Node - Stop the dqlite cluster calling
dqlite.Node.Close()against all nodes, but leave thedbobject open - Span a goroutine that calls
db.ExecContext("...", ctx) - Cancel the
ctxand check that it stops the connect loop that thego/sqlcode should have triggered by callingprotocol.Connector.Connect()in order to get a working connection. - The goroutine should terminate immediately
Sorry, I'm just now starting to look at this, for my understanding
# 0x46e7204 database/sql.(*Stmt).Exec+0x44 /u
sr/local/go/src/database/sql/sql.go:2651
Isn't it expected that you hang forever if you don't use ExecContext? Can't you replace existing Exec calls with ExecContext? Same with Ping, can't you use PingContext?
I agree though that it's prudent to add a default context with a sane timeout to protocol.Connector.Connect.
Yes, we are using <thing>Context exclusively now.
I leave it to discretion of the Dqlite team as to the specific mitigation, but I feel this behaviour would violate the assumptions of consumers.