go-dqlite icon indicating copy to clipboard operation
go-dqlite copied to clipboard

driver: Detect EOF in driverError

Open MathieuBordere opened this issue 2 years ago • 27 comments

Fixes https://github.com/canonical/go-dqlite/issues/182

Signed-off-by: Mathieu Borderé [email protected]

MathieuBordere avatar Apr 26 '22 10:04 MathieuBordere

@MathieuBordere once this is merged, please can you create a new release version and tag for this so we can update LXD's dependencies ready for LXD 5.1?

tomponline avatar Apr 26 '22 10:04 tomponline

With this change you would be able to re-merge the dqlite change that was reverted, right?

freeekanayaka avatar Apr 26 '22 10:04 freeekanayaka

With this change you would be able to re-merge the dqlite change that was reverted, right?

Yes indeed.

tomponline avatar Apr 26 '22 10:04 tomponline

But maybe I should test for EOF at the end of the other error handling? I'm just a bit confused by err being reassigned a couple of times e.g. do we get the original err back here? https://github.com/canonical/go-dqlite/blob/5724311830ff770590d89ae48b2ea40c6566dc95/driver/driver.go#L775

MathieuBordere avatar Apr 26 '22 10:04 MathieuBordere

I think that it is fine as it is as it will find EOF anywhere in the error chain (it will call Unwrap internally).

tomponline avatar Apr 26 '22 10:04 tomponline

@MathieuBordere can you fix up the test golint issues please (looks like just comment annotation issues)

https://github.com/canonical/go-dqlite/pull/186/files "Unchanged files with check annotations Beta "

tomponline avatar Apr 26 '22 10:04 tomponline

@MathieuBordere can you fix up the test golint issues please (looks like just comment annotation issues)

https://github.com/canonical/go-dqlite/pull/186/files "Unchanged files with check annotations Beta "

Done, the expected type, found '=' in type NodeInfo = client.NodeInfo doesn't show up locally running golint and is maybe related to https://github.com/golang/lint/issues/335

MathieuBordere avatar Apr 26 '22 11:04 MathieuBordere

But maybe I should test for EOF at the end of the other error handling? I'm just a bit confused by err being reassigned a couple of times e.g. do we get the original err back here?

https://github.com/canonical/go-dqlite/blob/5724311830ff770590d89ae48b2ea40c6566dc95/driver/driver.go#L775

You have a good point. Yes, I do think it's a good idea to place this at the end of the chain, in the default statement, and yes you should be fine with the err object returned by root.Unwrap().

So hopefully something like:

default:
		// When using a TLS connection, the underlying error might get
		// wrapped by the stdlib itself with the new errors wrapping
		// conventions available since go 1.13. In that case we check
		// the underlying error with Unwrap() instead of Cause().
		if root, ok := err.(unwrappable); ok {
			err = root.Unwrap()
		}
		switch err.(type) {
		case *net.OpError:
                        fallthrough
		case *io.EOF:
			log(client.LogDebug, "network connection lost: %v", err)
			return driver.ErrBadConn
		}

should work.

freeekanayaka avatar Apr 26 '22 11:04 freeekanayaka

One thing that concerns me is that I believe the proxy (both the LXD one and the go-dqlite/app one) might actually depend on getting back an io.EOF for graceful stopping and cleanup. It might be just a speculation, but the whole mechanism seems to have gotten a bit wild and might need some attention, designing and implementing a good system for leadership and network-related error handling and propagation, as we were saying in the other PR too.

freeekanayaka avatar Apr 26 '22 11:04 freeekanayaka

I suppose if the lxd tests pass OK then it should be fine

tomponline avatar Apr 26 '22 11:04 tomponline

I suppose if the lxd tests pass OK then it should be fine

suite is running now, will report back.

MathieuBordere avatar Apr 26 '22 12:04 MathieuBordere

There are also tests in this repo that check that the result set iterators return io.EOF at the end of the rows, so if they pass that will be encouraging. But moving the check to the default: section shouldn't hurt either if it makes the more specific error handlers have an opportunity to check first.

tomponline avatar Apr 26 '22 12:04 tomponline

I will mark this as draft for now, and investigate further before quickly patching up the issue uncovered by that dqlite PR, don't want to add to the mess.

I will handle the dqlite leadership loss in a different way than closing the connection.

MathieuBordere avatar Apr 26 '22 12:04 MathieuBordere

I will handle the dqlite leadership loss in a different way than closing the connection.

Which way are you thinking?

tomponline avatar Apr 26 '22 12:04 tomponline

I will handle the dqlite leadership loss in a different way than closing the connection.

Which way are you thinking?

It will close the SQLite leader connection to the database, eliminating the Assert you see in the lxd testsuite, it will finish the ongoing SQL request with a LEADERSHIP_LOST and all future db queries on that connection will return LEADERSHIP_LOST causing the driver to mark the connection as bad see https://github.com/canonical/go-dqlite/blob/5724311830ff770590d89ae48b2ea40c6566dc95/driver/driver.go#L752

MathieuBordere avatar Apr 26 '22 12:04 MathieuBordere

I will handle the dqlite leadership loss in a different way than closing the connection.

Which way are you thinking?

It will close the SQLite leader connection to the database, eliminating the Assert you see in the lxd testsuite, it will finish the ongoing SQL request with a LEADERSHIP_LOST and all future db queries on that connection will return LEADERSHIP_LOST causing the driver to mark the connection as bad see

https://github.com/canonical/go-dqlite/blob/5724311830ff770590d89ae48b2ea40c6566dc95/driver/driver.go#L752

Looking again at the description of #354, I feel there might be a bit of confusion about what LEADERSHIP_LOST is supposed to mean/do.

Essentially, LEADERSHIP_LOST should be returned in case the client asks for a commit and that commit can't be executed because the node lost leadership while waiting for a quorum of the frames command log entry that it had produced as soon as it received the commit request from the client.

In that case the client can't make any assumption about whether the commit was (or will be) actually successful, because there's no way to know if a quorum of nodes had received the entry. This is basically the situation described in section 6.3 (Implementing linearizable semantics) of the Raft dissertation, and the solution indicated there would be to implement client sessions and request IDs in the FSM. With that in place, the client would retry the commit request as soon as it finds a new leader, using sessions and request IDs the new leader would know if it is a brand new commit request, or actually a duplicated one (since the original one got committed despite the client received a LEADERSHIP_LOST error). This design is also described in the dqlite docs (see the "Client sessions" section at the end of the page), but it is still to be implemented.

Now, after having described what LEADERSHIP_LOST was meant to be, let's look at the problematic sequence of events described in #354 :

  • Node 1 is leader, Node 2 is follower
  • Node 1 receives BEGIN - QUERY XXXX - ... (notice that it doesn't receive a COMMIT message)
  • Node 1 transfers leadership to Node 2, or loses leadership in some way

^^^ at this point what I believe should happen is:

  • For the client: as soon as it tries to send the COMMIT message it will get back a NOT_LEADER error, since the gateway will notice that it's not the leader anymore. The Go driver should already handle that and propagate a driver.ErrBadConn so the connection gets closed by the client itself (both the network connection and the associated internal SQLite object/connection) with no need to do anything special from the server.
  • For the server: as soon as it detects that the node is not anymore the leader (using the monitor_cb) it rolls back any uncommitted transactions in the VFS, using VfsAbort().

And then, proceeding in the problematic sequence of events:

  • Node 2 writes some data to the db
  • Node 1 receives the frames, from Node 2, tries to apply them to the DB and triggers an Assert due to the pending transaction in the leader connection from step 2 that is holding locks.

^^^ at this point the assertion shouldn't be triggered since the server had rolled back that pending transaction.

What I described should result in relatively small changes in the monitor_cb to issue the relevant VfsAbort() calls, and nothing else. In the future we might want to tackle the much larger problem of proper LEADERSHIP_LOST handling using client sessions and request IDs.

freeekanayaka avatar Apr 26 '22 13:04 freeekanayaka

I will handle the dqlite leadership loss in a different way than closing the connection.

Which way are you thinking?

It will close the SQLite leader connection to the database, eliminating the Assert you see in the lxd testsuite, it will finish the ongoing SQL request with a LEADERSHIP_LOST and all future db queries on that connection will return LEADERSHIP_LOST causing the driver to mark the connection as bad see https://github.com/canonical/go-dqlite/blob/5724311830ff770590d89ae48b2ea40c6566dc95/driver/driver.go#L752

Looking again at the description of #354, I feel there might be a bit of confusion about what LEADERSHIP_LOST is supposed to mean/do.

Essentially, LEADERSHIP_LOST should be returned in case the client asks for a commit and that commit can't be executed because the node lost leadership while waiting for a quorum of the frames command log entry that it had produced as soon as it received the commit request from the client.

In that case the client can't make any assumption about whether the commit was (or will be) actually successful, because there's no way to know if a quorum of nodes had received the entry. This is basically the situation described in section 6.3 (Implementing linearizable semantics) of the Raft dissertation, and the solution indicated there would be to implement client sessions and request IDs in the FSM. With that in place, the client would retry the commit request as soon as it finds a new leader, using sessions and request IDs the new leader would know if it is a brand new commit request, or actually a duplicated one (since the original one got committed despite the client received a LEADERSHIP_LOST error). This design is also described in the dqlite docs (see the "Client sessions" section at the end of the page), but it is still to be implemented.

Now, after having described what LEADERSHIP_LOST was meant to be, let's look at the problematic sequence of events described in #354 :

* Node 1 is leader, Node 2 is follower

* Node 1 receives BEGIN - QUERY XXXX - ... (notice that it doesn't receive a COMMIT message)

* Node 1 transfers leadership to Node 2, or loses leadership in some way

^^^ at this point what I believe should happen is:

* For the client: as soon as it tries to send the COMMIT message it will get back a `NOT_LEADER` error, since the gateway will notice that it's not the leader anymore. The Go driver should already handle that and propagate a `driver.ErrBadConn` so the connection gets closed by the client itself (both the network connection and the associated internal SQLite object/connection) with no need to do anything special from the server.

* For the server: as soon as it detects that the node is not anymore the leader (using the `monitor_cb`) it rolls back any uncommitted transactions in the VFS, using `dqlite_vfs_abort`.

And then, proceeding in the problematic sequence of events:

* Node 2 writes some data to the db

* Node 1 receives the frames, from Node 2, tries to apply them to the DB and triggers an Assert due to the pending transaction in the leader connection from step 2 that is holding locks.

^^^ at this point the assertion shouldn't be triggered since the server had rolled back that pending transaction.

What I described should result in relatively small changes in the monitor_cb to issue the relevant dqlite_vfs_abort calls, and nothing else. In the future we might want to tackle the much larger problem of proper LEADERSHIP_LOST handling using client sessions and request IDs.

I understand your reasoning but VfsAbort doesn't get rid off the Assert, it's a read lock that is still being held in the test, it's just a read transaction.

integration-test: src/vfs.c:2372: vfsInvalidateWalIndexHeader: Assertion 'shm->shared[4] == 0' failed.

while VfsAbort basically just releases an exclusive lock https://github.com/canonical/dqlite/blob/bc1cb2d51cb7e0c3c72d3233c3dfa5c1453d33a8/src/vfs.c#L2449

I thought that closing the sqlite connection would be a fool proof way to get rid off all the state introduced by the start of the transaction.

MathieuBordere avatar Apr 26 '22 13:04 MathieuBordere

I understand your reasoning but VfsAbort doesn't get rid off the Assert, it's a read lock that is still being held in the test, it's just a read transaction.

integration-test: src/vfs.c:2372: vfsInvalidateWalIndexHeader: Assertion 'shm->shared[4] == 0' failed.

while VfsAbort basically just releases an exclusive lock https://github.com/canonical/dqlite/blob/bc1cb2d51cb7e0c3c72d3233c3dfa5c1453d33a8/src/vfs.c#L2449

I thought that closing the sqlite connection would be a fool proof way to get rid off all the state introduced by the start of the transaction.

Yeah, I was going to follow-up since I just realized that. Indeed VfsAbort should be used in the future for LEADERSHIP_LOST error, because in that case there will be a write lock.

In this case, instead of using VfsAbort, I'd suggest using a normal "ROLLBACK" SQL statement, e.g. sqlite3_exec(conn, "ROLLBACK", NULL, NULL, NULL). That should do it.

freeekanayaka avatar Apr 26 '22 14:04 freeekanayaka

Closing the connection altogether, probably works too, but we'd end up in bit of a mixed state in the gateway, which feels potentially a bit fragile, or will always need some care (i.e. we have to know that there are cases where the leader object or SQLite object is not there because it was closed elsewhere). Having a single path for releasing resource usually simplifies things.

freeekanayaka avatar Apr 26 '22 14:04 freeekanayaka

I understand your reasoning but VfsAbort doesn't get rid off the Assert, it's a read lock that is still being held in the test, it's just a read transaction. integration-test: src/vfs.c:2372: vfsInvalidateWalIndexHeader: Assertion 'shm->shared[4] == 0' failed. while VfsAbort basically just releases an exclusive lock https://github.com/canonical/dqlite/blob/bc1cb2d51cb7e0c3c72d3233c3dfa5c1453d33a8/src/vfs.c#L2449 I thought that closing the sqlite connection would be a fool proof way to get rid off all the state introduced by the start of the transaction.

Yeah, I was going to follow-up since I just realized that. Indeed VfsAbort should be used in the future for LEADERSHIP_LOST error, because in that case there will be a write lock.

In this case, instead of using VfsAbort, I'd suggest using a normal "ROLLBACK" SQL statement, e.g. sqlite3_exec(conn, "ROLLBACK", NULL, NULL, NULL). That should do it.

Yeah, I thought about this too, but there is a case where this can lead to inconsistent data I think.

  • Client issues BEGIN
  • Client issues some queries
  • Node to which client is connected loses leadership
  • We abort the transaction with ROLLBACK
  • SQLite is now in autocommit mode
  • Node to which client is connected regains leadership
  • Client continues queries (these will autocommit, while the queries from earlier will not be comitted)
  • Client issues COMMIT (this wil fail, as there is no BEGIN).

MathieuBordere avatar Apr 26 '22 14:04 MathieuBordere

I understand your reasoning but VfsAbort doesn't get rid off the Assert, it's a read lock that is still being held in the test, it's just a read transaction. integration-test: src/vfs.c:2372: vfsInvalidateWalIndexHeader: Assertion 'shm->shared[4] == 0' failed. while VfsAbort basically just releases an exclusive lock https://github.com/canonical/dqlite/blob/bc1cb2d51cb7e0c3c72d3233c3dfa5c1453d33a8/src/vfs.c#L2449 I thought that closing the sqlite connection would be a fool proof way to get rid off all the state introduced by the start of the transaction.

Yeah, I was going to follow-up since I just realized that. Indeed VfsAbort should be used in the future for LEADERSHIP_LOST error, because in that case there will be a write lock. In this case, instead of using VfsAbort, I'd suggest using a normal "ROLLBACK" SQL statement, e.g. sqlite3_exec(conn, "ROLLBACK", NULL, NULL, NULL). That should do it.

Yeah, I thought about this too, but there is a case where this can lead to inconsistent data I think.

* Client issues BEGIN

* Client issues some queries

* Node to which client is connected loses leadership

* We abort the transaction with ROLLBACK

* SQLite is now in autocommit mode

* Node to which client is connected regains leadership

* Client continues queries (these will autocommit, while the queries from earlier will not be comitted)

* Client issues COMMIT (this wil fail, as there is no BEGIN).

Okay, thanks for pointing this out. I actually had also another idea, which I felt was actually better but a bit more subtle. Your point makes me feel that perhaps it's actually the best approach.

Basically we would not do anything in the monitor_cb when the raft state changes, that's because there's always the chance that the node regains leadership and everything can proceed just as it was (like in your example).

Instead we should turn the assert(shm->shared[i] == 0) in vfsInvalidateWalIndexHeader() into a check: "if there are pending read transactions, rollback them, assuming they were started when this node was the leader and we do now have proof that there is another leader that has sent us new frames".

The only two things that I'm not totally sure about are:

  1. Do we have a way to turn that "assuming they were started when this node was the leader" part into something we can actually check instead of assuming? To avoid masking genuine cases where the assertion is valid and we should blow up because the read transaction was not started when the node was a leader.
  2. What does it mean to rollback here, probably releasing the read lock using vfsShmUnlock() should be enough, but I didn't check.

Point 1. is kind of minor, but something that would be nice to have.

freeekanayaka avatar Apr 26 '22 14:04 freeekanayaka

One possible way to solve point 1. would maybe to use some sort of flag that gets set in the monitor_cb call when converting from leader to follower: if the node regains leadership the flag would be unset (by monitor_cb, when noticing that the new state is leader), if the node receives frames from a new leader the VFS would see that flag and positively know that it has pending state to rollback.

freeekanayaka avatar Apr 26 '22 14:04 freeekanayaka

I don't really feel too comfortable with that, it feels a bit brittle as I don't fully understand it, I'd rather have SQLite itself do the cleanups for us for whatever state it introduced. I think that a case can be made for the fact that a connection to a leader node that at one point has lost leadership is "busted", shouldn't be reused and cleaned up.

I would propose to close the sqlite leader connection and add extra checks in functions that now assume that gateway->leader can never be NULL.

MathieuBordere avatar Apr 26 '22 14:04 MathieuBordere

If you close the SQLite connection only (and I guess destroy the struct leader object too?), and not the network one, how do you communicate that to the client? Assuming that no information gets sent to the client within the context of the monitor_cb (but correct me if that's not what you think), then the client will eventually try to issue a COMMIT statement, and what happens in that case if the node has regained leadership in the meantime? (as per your example).

freeekanayaka avatar Apr 26 '22 14:04 freeekanayaka

If you close the SQLite connection only (and I guess destroy the struct leader object too?), and not the network one, how do you communicate that to the client? Assuming that no information gets sent to the client within the context of the monitor_cb (but correct me if that's not what you think), then the client will eventually try to issue a COMMIT statement, and what happens in that case if the node has regained leadership in the meantime? (as per your example).

I would always reply "LEADERSHIP_LOST" in that case (or some other error message if you prefer), basically the connection will be unusable and should be interpreted that way by the client/driver.

So yes, it can happen that a leader node reports "LEADERSHIP_LOST", but it just means that the driver should try a new connection, that's a bit dirty.

MathieuBordere avatar Apr 26 '22 14:04 MathieuBordere

If you close the SQLite connection only (and I guess destroy the struct leader object too?), and not the network one, how do you communicate that to the client? Assuming that no information gets sent to the client within the context of the monitor_cb (but correct me if that's not what you think), then the client will eventually try to issue a COMMIT statement, and what happens in that case if the node has regained leadership in the meantime? (as per your example).

I would always reply "LEADERSHIP_LOST" in that case (or some other error message if you prefer), basically the connection will be unusable and should be interpreted that way by the client/driver.

Okay, so essentially you would set a flag on the gateway object (like the leadership_lost flag in your very first PR)? I would suggest then to not use LEADERSHIP_LOST (since it has a different semantics, i.e. "retry with new leader"), but instead use something else, e.g. SQLITE_NOTFOUND or anything more appropriate (e.g. one of the SQLITE_IOERR_XXX codes).

Or maybe no flag is needed, since the LOOKUP_DB(ID) call in gateway.c already checks for req->gateway->leader == NULL and returns SQLITE_NOTFOUND in that case.

freeekanayaka avatar Apr 26 '22 15:04 freeekanayaka

If you close the SQLite connection only (and I guess destroy the struct leader object too?), and not the network one, how do you communicate that to the client? Assuming that no information gets sent to the client within the context of the monitor_cb (but correct me if that's not what you think), then the client will eventually try to issue a COMMIT statement, and what happens in that case if the node has regained leadership in the meantime? (as per your example).

I would always reply "LEADERSHIP_LOST" in that case (or some other error message if you prefer), basically the connection will be unusable and should be interpreted that way by the client/driver.

Okay, so essentially you would set a flag on the gateway object (like the leadership_lost flag in your very first PR)? I would suggest then to not use LEADERSHIP_LOST (since it has a different semantics, i.e. "retry with new leader"), but instead use something else, e.g. SQLITE_NOTFOUND or anything more appropriate (e.g. one of the SQLITE_IOERR_XXX codes).

Or maybe no flag is needed, since the LOOKUP_DB(ID) call in gateway.c already checks for req->gateway->leader == NULL and returns SQLITE_NOTFOUND in that case.

Ok, I'll open a PR for that.

MathieuBordere avatar Apr 26 '22 15:04 MathieuBordere

I'll give this another go, if the LXD test suite is happy, I'll probably merge it.

MathieuBordere avatar May 11 '23 08:05 MathieuBordere

Thanks! :)

tomponline avatar May 11 '23 08:05 tomponline

Possibly related, possibly to be implemented in conjunction with this change.

https://pkg.go.dev/database/sql/driver

All Conn implementations should implement the following interfaces: Pinger, SessionResetter, and Validator.

type SessionResetter interface {
	// ResetSession is called prior to executing a query on the connection
	// if the connection has been used before. If the driver returns ErrBadConn
	// the connection is discarded.
	ResetSession(ctx [context](https://pkg.go.dev/context).[Context](https://pkg.go.dev/context#Context)) [error](https://pkg.go.dev/builtin#error)
}
type Validator interface {
	// IsValid is called prior to placing the connection into the
	// connection pool. The connection will be discarded if false is returned.
	IsValid() [bool](https://pkg.go.dev/builtin#bool)
}

MathieuBordere avatar May 11 '23 15:05 MathieuBordere