microcluster Fix long timeout when API & Schema updates happen at once

Because the API extension system is contingent on the presence of a particular schema update, we fall into a temporary deadlock when applying the first API extension. This PR addresses that by splitting up the notification for an upgraded API and upgraded schema.

To give an example, imagine 3 nodes.

node01 runs snap refresh and detects that its expected schema version is ahead of the other cluster members, so it waits.
node02 runs snap refresh and detects the same thing, but is only blocked on node03
node03 runs snap refresh, and since the previous two nodes have already updated their expected schema versions, this node can proceed with committing the changes to the schema.
node03 now progresses to comparing API versions since its updated schema supports this. It detects that it has the highest expected API version because node01 and node02 did not have the necessary schema updates in the earlier steps to record their expected API version, so node03 waits for those nodes to detect that node03 has committed the schema, so they can record their expected API updates for node03 to compare against.

So you get into a situation where all 3 nodes are waiting for each other. After 30s the loop repeats so node01 will detect that its schema version matches the other nodes, and then waits on only node02 to update its API version. Node02 then finally does this, and the database opens for access.

So because the schema update that introduces API updates is part of the same update that increments the number of API updates, the update process takes at least 30s in this case.

To fix this, we can split up how we notify other nodes that an upgrade can be performed. In the above case at bullet 4, node03 would instead immediately report to node01 and node02 when it completes the schema upgrade. Those nodes will unblock and be free to record their local API extensions to the database, repeating step 1, 2, and 3 but for API upgrades instead of schema upgrades, which have already completed.

To accomplish this, we need to split up the implementation of db.Open. Normally, db.Open sets the status of the database as available by closing its openCanceller. Instead, we will have 2 functions now:

Init - starts dqlite, verifies and applies schema upgrades (or blocks), prepares statements.
Open- verifies API upgrades, and finally sets the database to be available by closing the openCanceller.

When bootstrapping, we will run Init and Open in sequence. When joining or restarting, we will first run Init, then if we are not blocked, send a notification to the rest of the cluster that they may unblock on schema upgrades. Finally we run Open and send another notification that nodes may unblock on API upgrades.

We will distinguish between API/Schema upgrade notifications using a query parameter on the /internal/database endpoint.

May 03 '24 21:05 masnax

@masnax how this sort of thing achieved in lxd?

May 08 '24 10:05 tomponline

Shouldn't schema updates happen first on leader (once dqlite established) and then the nodes move onto per node patches? I believe this is the order its done in lxd, but worth checking.

May 08 '24 10:05 tomponline

@masnax how this sort of thing achieved in lxd?

LXD does it exactly the way it's currently done in microcluster. But since we introduced the API extension component after the fact, we can't check both at the same time since we need the schema update to run first.

Shouldn't schema updates happen first on leader (once dqlite established) and then the nodes move onto per node patches? I believe this is the order its done in lxd, but worth checking.

Wouldn't they always implicitly happen on the leader, as only the leader POSTs to the database?

May 08 '24 14:05 masnax

Wouldn't they always implicitly happen on the leader, as only the leader POSTs to the database?

I dont follow you here, the leader surely has direct access to the DB and doesn't need to POST via HTTP?

May 08 '24 16:05 tomponline

Wouldn't they always implicitly happen on the leader, as only the leader POSTs to the database?

I dont follow you here, the leader surely has direct access to the DB and doesn't need to POST via HTTP?

Sorry, POST was a poor choice of words. I thought I recall from our discussions on global locking in LXD that when we write to the dqlite database, the write is always performed by the leader, while reads can be any member.

May 08 '24 16:05 masnax

That's correct. That's why I dont understand how the deadlock you're describing is occurring and why there are notifications going on.

Also I'm struggling to follow what you mean by an "API update" - do you mean a per-member patch? If not, what is an API update if not a schema update?

May 08 '24 16:05 tomponline

That's correct. That's why I dont understand how the deadlock you're describing is occurring and why there are notifications going on.

Imagine 3 nodes have schema version 1 and no concept of API extensions. Imagine schema version 2 introduces the concept of API extensions, and the first API extension is introduced at the same time, so the list is 1 item long.

You sequentially run the update on all 3 nodes.
Node01 updates its row in the internal_cluster_members table to record that it wants schema version 2 (which introduces API extensions), but detects from dqlite that no other node is at version 2 yet, so it waits.
Node02 updates its row to schema version to 2 as well, and detects one node (node03) is still at version 1, so it waits.
Node03 updates its row to schema version to 2, and detects all other nodes are at least at the same version. It commits all schema updates, so now finally the schema itself has changed across the cluster. Now, since the column was just created, it updates its row in the internal_cluster_members table to record that it wants 1 API extension. Node3 then checks API extensions in the database and sees all other nodes are at 0, so it waits.

All 3 nodes are waiting, so for 30 seconds nothing happens until each node hits a timeout at which point they loop back and check the schema versions of the other nodes. Since all nodes expect schema version 2, every node can proceed to the API extension verification step that Node3 initiated.

This 30s timeout scenario will trigger every single time we add a new schema update and a new API extension at the same time.

The "wait" step is waiting for an API hit from whoever committed the schema updates, to unblock a channel and allow the daemon start process to continue. LXD does the same thing here: https://github.com/canonical/lxd/blob/ee205c8df469fff25d4030c8e720e86dad3a991e/lxd/daemon.go#L1262

Also I'm struggling to follow what you mean by an "API update" - do you mean a per-member patch? If not, what is an API update if not a schema update?

By the nodes recording their "API Update"s, I mean the nodes committing their locally defined set of API extensions (in LXD we store the size of the list) to the dqlite database so that the next node can compare its own record to the others.

LXD takes it for granted that the schema includes a column somewhere for this data. In microcluster, we can't do this, so we need to fully reconcile all schema updates before we can compare and commit API extensions for each node.

May 08 '24 17:05 masnax

This 30s timeout scenario will trigger every single time we add a new schema update and a new API extension at the same time.

The reason for that is because if we detect that we have a new schema update, we can't be sure if this update is the one that introduces the concept of API extensions, and as such we can't pre-emptively compare our API extension count to other nodes. So we have to wait for some cluster member to see that all nodes expect the same schema version, so that the leader can actually apply the schema updates, thus ensuring that a column for API extensions exists in the database. But whichever node gets to this point will just immediately move to compare its API extension count to the rest of the cluster, before notifying the cluster that they are able to do the same.

May 09 '24 19:05 masnax

The reason for that is because if we detect that we have a new schema update, we can't be sure if this update is the one that introduces the concept of API extensions

I believe in LXD we used a one-off patch logic for schema changes that couldn't be done in the schema patch system, and where we needed the nodes to be in sync, see https://github.com/canonical/lxd/blob/93d1d05c2f6374a571eae6a157d4928a2aeefcc9/lxd/patches.go#L390

This way we didn't need to complicate the normal DB patch process, but instead handled these one off (or at least infrequent) schema changes in the local-node patch system.

May 10 '24 08:05 tomponline

The reason for that is because if we detect that we have a new schema update, we can't be sure if this update is the one that introduces the concept of API extensions

I believe in LXD we used a one-off patch logic for schema changes that couldn't be done in the schema patch system, and where we needed the nodes to be in sync, see https://github.com/canonical/lxd/blob/93d1d05c2f6374a571eae6a157d4928a2aeefcc9/lxd/patches.go#L390

This way we didn't need to complicate the normal DB patch process, but instead handled these one off (or at least infrequent) schema changes in the local-node patch system.

Looks like LXD handles this by storing the previously run patches in the local database. Microcluster doesn't have a local database so we would have to write this to file in the state directory or use the global database as a 3rd schema version. The difference in LXD's case is that we don't send a notification to the rest of the cluster when we have a new patch to run.

It's actually a bit concerning to me that LXD doesn't communicate to the rest of the cluster when a particular node runs a patch that affects the cluster's whole database. What happens if a node is updated with a patch, but without a corresponding schema or API update while the rest of the cluster has not? Wouldn't every other node be left in a broken state?

May 13 '24 14:05 masnax