lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Storage: Add optimized volume refresh for Ceph RBD

Open roosterfish opened this issue 1 year ago • 11 comments

Overview

This PR adds optimized volume refresh for block volumes of the Ceph RBD storage driver using the export-diff/import-diff functions for transferring an incremental stream of snapshots.

At its core the functionality is already there for sparse local volume copies and the initial copy (migration) of volumes between remotes. See https://ceph.io/en/news/blog/2013/incremental-snapshots-with-rbd/ for some details on the methodology.

Local volume refreshes will just use the new approach whilst for migration a new type gets added.

Migration protocol change

In order to allow refreshing VMs that also have a filesystem volume, a new migration type RBD_AND_RSYNC is added that allows refreshing the VMs root volume using the optimized approach and falls back to rsync for the corresponding filesystem volume. This new type allows the source and target to exchange rsync features like it is done for the BLOCK_AND_RSYNC type.

Additionally this new type replaces the RBD migration type as RBD_AND_RSYNC can also be used for volumes with content type filesystem because they will only use the rsync portion of the migration type. This also means when doing migrations to and from older LXD servers the negotiated migration type will fallback to either RSYNC or BLOCK_AND_RSYNC for the initial copy as RBD isn't anymore offered. For volumes with content type filesystem the refresh will continue using rsync. For volumes with content type block the new optimized approach takes place if both source and target LXD have agreed on using RBD_AND_RSYNC which is the default if supported on both ends.

RBD_AND_RSYNC behavior

At first the target gets reset to the last common snapshot between source and target. If there aren't any snapshots on the target, the volume gets recreated. The same happens when refreshing a volume that doesn't have any snapshots.

The incremental stream starts by checking for the most common volume snapshot that is present on both the source and the target:

  1. In case there are snapshots missing on the target, a diff is generated between the missing snapshot and its predecessor: rbd export-diff vol@{snap} --from-snap {predecessor snap} --> rbd import-diff vol2 The snapshot gets created automatically on the target when importing the diff. This process gets repeated for all the missing snapshots on the target.
  2. In case the number of snapshots is equal on both the source and the target the entire diff from the last common snapshot gets transferred: rbd export-diff vol --from-snap {last common} --> rbd import-diff vol2
  3. In case the volume on the source doesn't have any snapshots all of its contents get transferred to the target: rbd export-diff vol --> rbd import-diff vol2 That is because the diff contains all of the volume's data.

Speed improvement

On my test system using a 10GiB root volume for each of the instances (including two snapshots that already exist on both source and target) the following speed improvements can be observed (without any changes on the root volume since the last snapshot):

  • VM (old approach): 31.3s
  • VM (new approach): 3.1s

Those values of course differ depending on the actual volume size and data that has been changed since the last snapshot.

Fixes https://github.com/canonical/lxd/issues/12668 Fixes https://github.com/canonical/lxd/issues/12721

roosterfish avatar Jan 18 '24 15:01 roosterfish

Due to https://github.com/canonical/lxd/issues/12744 I wasn't yet able to test the filesystem UUID generation.

roosterfish avatar Jan 18 '24 15:01 roosterfish

@roosterfish can be rebased and tested now. Thanks

tomponline avatar Jan 19 '24 14:01 tomponline

@roosterfish as you're looking at refreshes currently, could you take a look at https://github.com/lxc/incus/pull/419 and see if we could do with cherry-picking that? Thanks

tomponline avatar Jan 23 '24 09:01 tomponline

@tomponline linting is hitting hard here, but besides this could you please have a first look? :) Both local and remote volume refresh is now in place. I have also updated the PR's description on supporting older LXD versions when doing refreshes between remotes.

It looks custom Btrfs and XFS block volumes (on Ceph?) don't work on current main. From what I see they are missing the filesystem right after creation. So this is still something to get straight before I can test the Btrfs/XFS FS regeneration.

roosterfish avatar Jan 31 '24 17:01 roosterfish

Try rebasing as @markylaing relaxed some rules

tomponline avatar Jan 31 '24 18:01 tomponline

It looks custom Btrfs and XFS block volumes (on Ceph?) don't work on current main

Do you mean for custom filesystem and container volumes? We should add a test for that to this repo then.

tomponline avatar Jan 31 '24 18:01 tomponline

It looks custom Btrfs and XFS block volumes (on Ceph?) don't work on current main

Do you mean for custom filesystem and container volumes? We should add a test for that to this repo then.

I got confused by --type block volume and block.filesystem. Whilst the first indicates an actual block vol, the second applies for drivers with block volumes that get an FS on top. This doesn't affect containers and custom filesystem volumes.

But we should fix the fact that a --type block custom volume has the block.filesystem config key. There is a check for this but it only applies to the pools config, not the one specified during creation (https://github.com/canonical/lxd/blob/main/lxd/storage/drivers/driver_ceph_volumes.go#L799) e.g. lxc storage volume create ceph vol --type block block.filesystem=btrfs.

For the LVM driver this isn't an issue because it checks for the right content type (https://github.com/canonical/lxd/blob/main/lxd/storage/drivers/driver_lvm_volumes.go#L1228). We have to add such a check also to the Ceph driver (backwards compatibility with existing custom vols that already have this config key)

roosterfish avatar Feb 01 '24 08:02 roosterfish

@roosterfish are you continuing on with this one next?

tomponline avatar Feb 20 '24 16:02 tomponline

@roosterfish are you continuing on with this one next?

Yes, next one on the list.

roosterfish avatar Feb 20 '24 16:02 roosterfish

I was just able to reproduce the initial error again during testing:

lxc cp c1 c2 --refresh
Error: Failed to create file "/var/snap/lxd/common/lxd/containers/c2/backup.yaml": open /var/snap/lxd/common/lxd/containers/c2/backup.yaml: bad message

I am wondering if we write different backup.yaml files at the target in certain situations. I'll continue investigating this. Will move the PR again into draft state for now.

roosterfish avatar Mar 04 '24 08:03 roosterfish

@tomponline this is now ready for review again. I was fighting one missing bit in regards to the multi sync flag when copying containers that is fixed with https://github.com/canonical/lxd/pull/12743/commits/39f272433a71c39628758aaa0deed09f229a1b41.

What has changed since the last time:

  • The migration type RBD is now deprecated in favor of RBD_AND_RSYNC. This also required some changes in places where normally just the types RSYNC and BLOCK_AND_RSYNC are handled.
  • The RBD migration type is now marked as deprecated in protobuf as there isn't any reference anymore to it's corresponding variable in Go.
  • The logic for finding the last common snapshot is now heavily reduced and covered by a unit test (much easier to read now)
  • The docs now reflect when you can expect an optimized transfer/refresh

roosterfish avatar Mar 06 '24 16:03 roosterfish

I wonder if we could still support RBD mode only if offered from the source so that migrating an instance from an older host to a newer one would be efficient and doesn't do a full block transfer.

So if source is older and offers RBD and the volume isn't available on the target (not a refresh) then the nwer lxd target host would respond with RBD mode also an expect b RBD mode from the source.

For refreshes or if the source is newer also and offers RBD_AND_RSYNC then the newer target would respond with RBD_AND_RSYNC, and this would cause older sources to fallback.

For newer host source targeting an older host it would offer RBD_AND_RSYNC always and fallback.

tomponline avatar Mar 06 '24 21:03 tomponline

I wonder if we could still support RBD mode only if offered from the source so that migrating an instance from an older host to a newer one would be efficient and doesn't do a full block transfer.

That's it, thanks for the suggestion! I was voting for deprecating RBD as I haven't tested migrating the other way around. But I can confirm this works for an older LXD that tries to send to a newer LXD using RBD as the newer LXD can simply fallback to RBD if we put it in the second position of the supported migration types list:

return []migration.Type{
	{
		FSType:   migration.MigrationFSType_RBD_AND_RSYNC,
		Features: rsyncFeatures,
	},
	{
		FSType:   migration.MigrationFSType_RBD,
		Features: rsyncFeatures,
	},
	{
		FSType:   migration.MigrationFSType_BLOCK_AND_RSYNC,
		Features: rsyncFeatures,
	},
}

The same applies for the initial copy of containers.

roosterfish avatar Mar 07 '24 08:03 roosterfish

ready for rebase

tomponline avatar Mar 07 '24 12:03 tomponline

@tomponline Pipeline passed, ready for review.

roosterfish avatar Mar 07 '24 14:03 roosterfish

@roosterfish also do we have tests in lxd-ci that check for VM and block custom volume refresh ?

tomponline avatar Mar 07 '24 14:03 tomponline

LGTM!

One thing I would like you to check separately is what happens in a failure scenario where some of the snapshots have been synced but then a comms issue occurs and the migration is cancelled. Can a refresh be retried successfully, or will the migration complete but leave the volumes in an inconsistent state because the diffs between snapshots were incorrect due to the DB changes being applied but not undone?

That is an interesting thought. My gut feeling is that the error will be propagated back to the LXD pool backend where it would then revert the DB entries so a subsequent migration refresh will start from scratch. However there might be some snapshot leftovers in storage that would possibly collide with the ones from the new refresh. Smells like this requires a reverter too in the storage driver.

roosterfish avatar Mar 07 '24 14:03 roosterfish

@roosterfish also do we have tests in lxd-ci that check for VM and block custom volume refresh ?

Looks like there aren't any, I'll add some.

roosterfish avatar Mar 07 '24 15:03 roosterfish