microcluster Queries return `Error: Daemon not yet initialized` while waiting for cluster member upgrades

If one member of the cluster gets upgraded to a version that introduces new schema upgrade, queries made from upgraded member result in "Daemon not initialized" error which is bit misleading. Example:

$ microovn cluster list
Error: Daemon not yet initialized

I'm using microovn as an example because I have it at hand but the error comes verbatim from calling microcluster.internal.rest.client.Client.GetClusterMembers().

Looking at the logs I get more informative message

level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://10.5.3.129:6443"

This state persists until all cluster members get upgraded at which point everything goes back to normal. Is there a possibility to get back more precise error message, like the one that's in the logs, when calling API via MicroCluster client? Alternatively are there any checks that I could perform before doing requests that would get me this message? I tried MicroCluster.Ready() and MicroCluster.Status() but both returned only generic "cluster is not ready" message.

Jun 16 '23 12:06 mkalcok

So the reason you're getting that error is because after the upgrade, the daemon has genuinely stalled its on-start initialization process and is waiting for the other peers to be available before opening the database.

As of now it's not possible to discern the reason that the database is offline, only whether or not it is. Out of curiosity, what's your use case for needing to know whether the daemon is offline because of a version mismatch or some other reason?

Jun 27 '23 16:06 masnax

It's just the matter of conveying right information to the user. Hypothetical situation:

There's a 3 node cluster
Snap on 1 node get's automatically updated
Rest of the cluster members are still running old DB because automatic snap refresh did not occur yet
User comes along and tries to run some commands.

In this case user would get information that the cluster is not yet initialized which is misleading. At least to me that sounds like I should do bootstrap or join to get this node running, when In fact I just need to get rest of the cluster updated.

Feel free to close this issue if the fix is not feasible. I just thought I'd raise this in case there's a relatively easy way to distinguish reasons for DB being offline.

Jun 27 '23 19:06 mkalcok

In this case, we would need to modify the db.IsOpen function as well as the db.Open function to discern between the different reasons that the database is offline.

Currently we just check if we have a database struct, and if so then whether its context is cancelled. I think we can try to instantiate the database in either case, and then have a status field on the struct that reports what stage of the startup we failed at.

It can be something like:

ready - the database is operational, so IsOpen returns no error
waiting - the database is non-operational becasue it's waiting for a cluster upgrade (or just an abnormally slow start) IsOpen should return a descriptive error
database is nil - the database is totally uninitialized, so we should return an error saying as much.

This would mean changing the signature of IsOpen to return an error instead of a bool.

Feb 16 '24 20:02 masnax

Even better, we can utilize https://github.com/canonical/lxd/blob/6aca66d10de94cc4a9b22cae57d1619efafd8d01/shared/api/error.go#L10 here and specify an HTTP status error so that projects using microcluster don't have to actually parse the error message to discern why microcluster's database isn't open, and can instead just check the HTTP status code.

Feb 16 '24 20:02 masnax

If there is no problem I will assign this to myself

Mar 21 '24 21:03 hamistao