lxd icon indicating copy to clipboard operation
lxd copied to clipboard

No useful output from lxc commands after downgrade

Open abentley opened this issue 2 years ago • 5 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: 22.04
  • The output of "lxc info" or if that fails:
    • Kernel version: 5.15.0-1012-kvm
    • LXC version: 4.0
    • LXD version: 4.0
    • Storage backend in use: dir

Issue description

If the database has been initialized, even if lxd init has not been run, using "snap refresh" to downgrade will make most or all lxc commands (and lxd init) useless.

It seems like virtually any lxc command, not just lxd init, will initialize the database. Once it has been initialized, downgrading lxd will mean that the schema is invalid for that downgraded version. Having the wrong schema will prevent any communication on the socket. Commands will die with messages like Error: Get "http://unix.socket/1.0": EOF

What I expect:

  • non-mutating operations like lxc list or lxd init --dump do not have any observable side effects. In particular, they do not break downgrading.
  • when the DB schema is invalid, lxc commands produce an error message saying that the DB schema is invalid. (This implies that lxd starts up in a very limited mode when the DB schema is invalid, instead of falling over.) Ideally they say which version of LXD is compatible with the DB.
  • when /usr/bin/snap refresh --channel=4.0/stable lxd would produce a broken configuration, it does nothing and exits with an error. Presumably there would be a --force option to override.

Steps to reproduce

(recommended to do this in a vm launched with lxc launch --vm ubuntu:22.04)

  1. sudo /usr/bin/snap install --channel=5.0/stable lxd
  2. sudo lxc ls
  3. sudo /usr/bin/snap refresh --channel=4.0/stable lxd
  4. sudo lxc ls

Information to attach

  • [ X ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • [ X ] Output of the client with --debug

lxd.log debug-output.txt

abentley avatar Jun 22 '22 22:06 abentley

It would also be nice if there were a way to run lxd init to recover from this situation, e.g. lxd init --overwrite --preseed.

abentley avatar Jun 23 '22 12:06 abentley

I'm not sure there's a lot we can do about this. The reason is that:

  • There is no such thing as "initializing LXD", lxd init isn't special in any way, it just uses the normal REST API to setup storage, network, default profile, ...
  • The LXD database is needed for any of the API to function so it's automatically initialized on daemon startup
  • Similarly schema updates must be applied prior to any DB access, so they're applied extremely early on startup
  • LXD is socket activated on Ubuntu, so will start up when any lxc command is run or when anything else hits the unix socket
  • The lxc tool is just a REST API client, when the REST API isn't available because the daemon refused to start, it cannot connect and can't tell why

What we do to try and help with those situations is:

  • A clear downgrade error should be visible in the LXD log (/var/snap/lxd/common/lxd/logs/lxd.log)
  • A similar error should also be visible in journalctl -u snap.lxd.daemon
  • On DB upgrades, LXD makes a backup of the DB at /var/snap/lxd/common/lxd/database/global.bak, this can be restored should a downgrade be needed

@ru-fu we probably ought to add a doc page on the upgrade behavior that would more directly cover this

stgraber avatar Jun 23 '22 14:06 stgraber

  • There is no such thing as "initializing LXD", lxd init isn't special in any way, it just uses the normal REST API to setup storage, network, default profile, ...

Is there a reason why it can't be special? There's nothing to stop lxd init --overwrite from deleting the database, is there? Since it's not an lxc command, it seems perfectly reasonable to support local-only operations.

  • The LXD database is needed for any of the API to function so it's automatically initialized on daemon startup

Until there's mutation, you have a choice about whether to preserve the database after an operation.

  • The lxc tool is just a REST API client, when the REST API isn't available because the daemon refused to start, it cannot connect and can't tell why

As I mentioned, supporting a sane error message implies that the daemon doesn't just fall over when the schema is invalid. You have the option of allowing the daemon to run with an invalid schema, and on every connection report that the schema is invalid.

abentley avatar Jun 23 '22 15:06 abentley

Is there a reason why it can't be special? There's nothing to stop lxd init --overwrite from deleting the database, is there? Since it's not an lxc command, it seems perfectly reasonable to support local-only operations.

lxd init is run as an unprivileged user and so doesn't have write access needed to wipe the database, nor does it now what init system you're using and how to restart LXD afterwards. So this kind of thing is actually better done by the user.

To properly reset LXD, the procedure usually is:

  • rm -Rf /var/snap/lxd/common/lxd
  • reboot

The reboot part also takes care of wiping any kernel state, disks, networks, ... which may be in place as merely getting rid of the database doesn't handle that.

stgraber avatar Jun 23 '22 16:06 stgraber

As I mentioned, supporting a sane error message implies that the daemon doesn't just fall over when the schema is invalid. You have the option of allowing the daemon to run with an invalid schema, and on every connection report that the schema is invalid.

LXD requires database access to setup the network listeners and requires the daemon config to be read from database to setup the API handlers. The easiest way we could do something like this would be to have a completely separate listener and API handler just for this one case.

stgraber avatar Jun 23 '22 16:06 stgraber