go-dqlite Add a default timeout for Connector.Connect

Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.

When this happens, existing connections keep trying to connect, with our logs repeated output like this:

2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused

An existing call to Exec is locked up at this point:

#       0x46a8d4        time.Sleep+0x134                                                                /u
sr/local/go/src/runtime/time.go:195
#       0x1c14f28       github.com/canonical/go-dqlite/internal/protocol.makeRetryStrategies.func1+0x48 /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:321
#       0x1c12799       github.com/Rican7/retry.shouldAttempt+0x39                                      /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:32
#       0x1c126f6       github.com/Rican7/retry.Retry+0x96                                              /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:19
#       0x1c131c4       github.com/canonical/go-dqlite/internal/protocol.(*Connector).Connect+0xa4      /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:74
#       0x1c42628       github.com/canonical/go-dqlite/driver.(*Connector).Connect+0x268                /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/driver/driver.go:261
#       0x1c05d22       database/sql.(*DB).conn+0x762                                                   /u
sr/local/go/src/database/sql/sql.go:1393
#       0x1c0ce52       database/sql.(*Stmt).connStmt+0x252                                             /u
sr/local/go/src/database/sql/sql.go:2724
#       0x1c0c364       database/sql.(*Stmt).ExecContext+0x184                                          /u
sr/local/go/src/database/sql/sql.go:2628
#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

This is problematic given Juju's architecture, because our worker loops can be blocked from actually shutting down.

Particularly bad is that this happens with db.Ping, which we use for health checks.

Mar 20 '23 12:03 manadart

It looks like the app module doesn't support tweaking the driver parameters that control that retry loop. We can fix that easily enough. You're not seeing the affected connection retry forever, right? With the default parameters it should be capped at 1 second:

https://github.com/canonical/go-dqlite/blob/ab30b33a86583ff5b1c1245a6f1f9b844c1a8f1f/app/app.go#L215

https://github.com/canonical/go-dqlite/blob/ab30b33a86583ff5b1c1245a6f1f9b844c1a8f1f/internal/protocol/connector.go#L50-L52

Mar 20 '23 23:03 cole-miller

It might be prudent to also wrap all the contexts in the driver with a timeout as well. Unfortunately, we can't provide our own contexts in those locations, so a context wrapped in a WithTimeout of 30 seconds or greater. If a commit can't complete within 30 seconds then it's probably never going to succeed and we should probably give up then.

https://github.com/canonical/go-dqlite/blob/fd457b7f08f8c466640e940d4d8fa26e8ee8ad1c/driver/driver.go#L517-L525

I don't mind providing a patch to the driver package for this.

Jun 07 '23 14:06 SimonRichardson

You can do something like this to set a default timeout if not specified by the passed context:

	_, ok := ctx.Deadline()
	if !ok {
		// Set default timeout of 30s if no deadline context provided.
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, time.Duration(30*time.Second))
		defer cancel()
	}

That way if someone does want a longer timeout you're not forcing a particular policy in the driver.

Jun 07 '23 14:06 tomponline

FWIW LXD sets a default timeout on the transaction rather than each query:

https://github.com/lxc/lxd/blob/master/lxd/db/query/transaction.go#L15-L16

Which has been useful while we transition to passing context into each query too.

It also avoids us accidentally keeping transactions open too long and blocking writes to the DB.

Jun 07 '23 14:06 tomponline

@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an Exec or ExecContext then it will block. The problem I'm showing is that this is still a possibility because the driver does a COMMIT (there are many more) with a context that isn't bound to a timeout. So can hang.

You can do something like this to set a default timeout if not specified by the passed context:

Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.

FWIW LXD sets a default timeout on the transaction rather than each query:

We do the same thing in Juju.

https://github.com/juju/juju/blob/09db847e363a757089dcb569158d7ea0dc3a3a7a/database/txn/transaction.go#L172-L173

Jun 07 '23 14:06 SimonRichardson

Does cancelling the transaction not end the query too. I've not observed LXD getting stuck on Execs when the transaction is cancelled.

Jun 07 '23 14:06 tomponline

Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.

Agreed. I suppose if using Exec and not ExecContext one would not expect there to be a default timeout in the driver and indefinite blocking is a possibility (in any driver, not just dqlite).

The reason I added the 10s timeout to BeginTx in LXD was I found that it did cause tx.Exec() to stop if dqlite was blocking for some reason. Because we did see some indefinite blocking before that and dont see them anymore.

Jun 07 '23 14:06 tomponline

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

Jun 07 '23 14:06 tomponline

So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.

https://github.com/canonical/go-dqlite/blob/fd457b7f08f8c466640e940d4d8fa26e8ee8ad1c/internal/protocol/protocol.go#L58-L62

Jun 07 '23 14:06 SimonRichardson

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

Sounds like a plan to me.

Jun 07 '23 14:06 SimonRichardson

So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.

https://github.com/canonical/go-dqlite/blob/fd457b7f08f8c466640e940d4d8fa26e8ee8ad1c/internal/protocol/protocol.go#L58-L62

Am I correct that the only place where a user can't provide a context is tx.Commit() ?

Not sure why the go/sql package has chosen to not support contexts for tx.Commit() (and I don't see any newly introduced extension that changes that), perhaps because that is kind of dangerous and leads to undefined behavior in general (has the tx succeeded or not?). In any case a COMMIT query in dqlite can't hang forever, because at some point (relatively quickly with default values, a few seconds) the leader will step down and stop trying to commit if it can't reach a majority.

@SimonRichardson are you positively seeing a tx.Commit() call that hangs?

The logs attached in the very first message of this issue seem a retry loop for getting a driver.Conn, which I would say is caused by Connector.Connect() not being passed a ctx with a deadline. I didn't check Go's code, but I'd expect the ctx passed to Connector.Connect() to be the same as whatever the user has passed to tx.ExecContext and friends.

If tx.Exec() is being used instead of tx.ExecContext() that might explaining it, and we may want to add a default timeout to our Connector.Connect() implementation, as also suggested by the Go documentation for Connector.Connect:

        // The provided context.Context is for dialing purposes only
	// (see net.DialContext) and should not be stored or used for
	// other purposes. A default timeout should still be used
	// when dialing as a connection pool may call Connect
	// asynchronously to any query.

I'm a not totally convinced that a default timeout should be added to protocol.Call() though, as proposed here, since as said tx.Commit() should not hang, and everything else can be controlled by the user, unless I'm missing something.

Jun 07 '23 14:06 freeekanayaka

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

As mentioned, as far as I understand it the only place where we would need to add a default timeout to achieve the behavior above should be protocol.Connector.Connect().

Jun 07 '23 15:06 freeekanayaka

are you positively seeing a tx.Commit() call that hangs?

Not 100% positive yet.

Jun 07 '23 15:06 SimonRichardson

FWIW I kind of see why you wouldn't want tx.Commit() to get a context parameter. Essentially it seems that what go/sql supports is effectively sugar around transaction rollback (you just cancel the context you pass to the various tx methods), which is a normal database operation. On the other hand, once a COMMIT has been issued there's usually no way to rollback, and it either succeeds or fail.

If Juju's main use case for cancellation/rollback is graceful shutdown (e.g. of a worker), then I'd say that a top-level context that you can cancel would be the way to go (as opposed to timeouts), and I presume that's already the case in Juju's code. Assuming that's true, as long as all juju's database calls get passed a context linked to that top-level one, then all transactions should already immediately abort/rollback as soon as the top-level context is cancelled (except for tx.Commit() which will finish in at most a few seconds, possibly returning an error due to cluster unavailability). Relying on timeouts for graceful shutdown would actually be sub-optimal. If juju's database calls are not already all making use of context.Context, it'd be nice to change that, in order to support a more streamlined graceful shutdown sequence.

Jun 07 '23 16:06 freeekanayaka

and I presume that's already the case in Juju's code

This is correct.

If juju's database calls are not already all making use of context.Context, it'd be nice to change that

We're using context.

Jun 07 '23 17:06 SimonRichardson

and I presume that's already the case in Juju's code

This is correct.

If juju's database calls are not already all making use of context.Context, it'd be nice to change that

We're using context.

Ok, so assuming that the only hang observed is the one originally pasted at the very beginning of this issue (the connect loop of protocol.Connector.Connect()), we'd need to investigate why the top-level Juju's context cancellation is not propagated down to that loop.

Jun 07 '23 17:06 freeekanayaka

Let me get back to you on this. I've not been able replicate it.

Jun 07 '23 17:06 SimonRichardson

First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:

Run some successful queries against an in-memory cluster of dqlite.Node
Stop the dqlite cluster calling dqlite.Node.Close() against all nodes, but leave the db object open
Span a goroutine that calls db.ExecContext("...", ctx)
Cancel the ctx and check that it stops the connect loop that the go/sql code should have triggered by calling protocol.Connector.Connect() in order to get a working connection.
The goroutine should terminate immediately

Jun 07 '23 17:06 freeekanayaka

Sorry, I'm just now starting to look at this, for my understanding

#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

Isn't it expected that you hang forever if you don't use ExecContext? Can't you replace existing Exec calls with ExecContext? Same with Ping, can't you use PingContext?

I agree though that it's prudent to add a default context with a sane timeout to protocol.Connector.Connect.

Sep 01 '23 10:09 MathieuBordere

Yes, we are using <thing>Context exclusively now.

I leave it to discretion of the Dqlite team as to the specific mitigation, but I feel this behaviour would violate the assumptions of consumers.

Sep 14 '23 11:09 manadart