kubevirt icon indicating copy to clipboard operation
kubevirt copied to clipboard

volume migration: persist the volume migration state in the VM status

Open alicefr opened this issue 1 year ago • 23 comments

What this PR does

Before this PR: The information about the volume migration was lost if the VMI was deleted for any reason or the VM shutdown

After this PR: Add the volume migration state in the VM status to persist the information

Release note

Add the volume migration state in the VM status

alicefr avatar Jul 15 '24 15:07 alicefr

/cc @awels @mhenriks

alicefr avatar Jul 15 '24 15:07 alicefr

Can you talk more about how storing this info in the VM is useful to users? Are they expected to do something when a stopped VM has volume migrations in progress (like rollback the configuration)? Otherwise will the migration be resumed when the VM is restarted? Or will it be aborted? I don't see that the VM controller acts on this info when the VM is restarted.

mhenriks avatar Jul 15 '24 18:07 mhenriks

Can you talk more about how storing this info in the VM is useful to users? Are they expected to do something when a stopped VM has volume migrations in progress (like rollback the configuration)? Otherwise will the migration be resumed when the VM is restarted? Or will it be aborted? I don't see that the VM controller acts on this info when the VM is restarted.

Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes. We have 2 options:

  1. Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes
  2. Simply add the information in the VM status and the user need to take action

With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.

alicefr avatar Jul 16 '24 06:07 alicefr

Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.

Why can't the user rollback to the old volumes? And where is this enforced?

We have 2 options:

Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes

Can you actually do this without adding something to the VM status?

Simply add the information in the VM status and the user need to take action

With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.

What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data? Also won't the VM status.migratedVolumes field be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?

To me, the first option makes more sense. If a user is going to go through the trouble of manually doing offline copies, they can also do something like update the VM spec by removing spec.UpdateVolumeStrategy.

mhenriks avatar Jul 16 '24 14:07 mhenriks

Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.

Why can't the user rollback to the old volumes? And where is this enforced?

Users should rollback to the old volumes, but we don't want to enforce it. What if the users decide to do a csi clone while the VM is offline.

We have 2 options:

Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes

Can you actually do this without adding something to the VM status?

I don't see how since the VMI is gone

Simply add the information in the VM status and the user need to take action

With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.

What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data?

Yes, this is possible, like the VM might not be bootable.

Also won't the VM status.migratedVolumes field be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?

The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.

To me, the first option makes more sense. If a user is going to go through the trouble of manually doing offline copies, they can also do something like update the VM spec by removing spec.UpdateVolumeStrategy.

I think both options are doable, but the second is easier to implement. @awels WDYT?

alicefr avatar Jul 16 '24 14:07 alicefr

The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.

Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.

Why can't the user rollback to the old volumes? And where is this enforced?

Users should rollback to the old volumes, but we don't want to enforce it. What if the users decide to do a csi clone while the VM is offline.

Ah okay I think I was confused "we" in this case refers to virt-controller not users

We have 2 options:

Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes

Can you actually do this without adding something to the VM status?

I don't see how since the VMI is gone

Just wanted to confirm that this change (add fields to status) is necessary regardless of which option we choose

Simply add the information in the VM status and the user need to take action

With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.

What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data?

Yes, this is possible, like the VM might not be bootable.

Also won't the VM status.migratedVolumes field be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?

The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.

Okay I assumed GenerateMigratedVolumes gets called whenever the VM reconciles and won't have "old" volume info

To me, the first option makes more sense. If a user is going to go through the trouble of manually doing offline copies, they can also do something like update the VM spec by removing spec.UpdateVolumeStrategy.

I think both options are doable, but the second is easier to implement. @awels WDYT?

mhenriks avatar Jul 16 '24 19:07 mhenriks

Just to add my 2 cents to this conversation. We want to store somewhere the names of the old volumes in a persistent fashion. The VMI is not suitable for this since it can disappear when the VM is stopped for whatever reason. The user may or may not know what the original values are so having it stored somewhere is helpful.

In my migration controller I have the ability to keep track of the volume names but I would prefer to read it from the real source the VM. I feel option 2 is more flexible for the user, as at this point the VM is probably down and doing a different copy might be preferable at this point.

awels avatar Jul 16 '24 19:07 awels

The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.

Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.

Why can't the user rollback to the old volumes? And where is this enforced?

Users should rollback to the old volumes, but we don't want to enforce it. What if the users decide to do a csi clone while the VM is offline.

Ah okay I think I was confused "we" in this case refers to virt-controller not users

We have 2 options:

Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes

Can you actually do this without adding something to the VM status?

I don't see how since the VMI is gone

Just wanted to confirm that this change (add fields to status) is necessary regardless of which option we choose

Yes, it is required since it is the only object which persists.

Simply add the information in the VM status and the user need to take action

With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.

What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data?

Yes, this is possible, like the VM might not be bootable.

Also won't the VM status.migratedVolumes field be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?

The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.

Okay I assumed GenerateMigratedVolumes gets called whenever the VM reconciles and won't have "old" volume info

It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.

alicefr avatar Jul 17 '24 07:07 alicefr

It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.

This is good I just wanted to make sure this info doesn't get lost right away

mhenriks avatar Jul 17 '24 14:07 mhenriks

It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.

This is good I just wanted to make sure this info doesn't get lost right away

What happens to this new field when a migration completes successfully (happy path)? Does it get cleared?

mhenriks avatar Jul 18 '24 14:07 mhenriks

It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.

This is good I just wanted to make sure this info doesn't get lost right away

What happens to this new field when a migration completes successfully (happy path)? Does it get cleared?

It is cleared on the VMI but not on the VM. On the next volume update, the migrated volumes list will be updated with the new change, but since then, we keep track that there has been a volume migration and the source/destination couples.

EDIT: no difference between the happy path and a failure

alicefr avatar Jul 18 '24 14:07 alicefr

@mhenriks as we discussed offline, I have encapsulated the migrated volumes in a new struct in the VM status. Additionally, I added a bool to signal if the migration was successful or not. Unfortunately, this required also to add a new field in the VMI status as well. WDYT?

alicefr avatar Jul 23 '24 10:07 alicefr

/test pull-kubevirt-e2e-k8s-1.30-sig-storage

alicefr avatar Jul 23 '24 14:07 alicefr

@vladikr would you mind to take a look to this PR when you have a bit of time?

alicefr avatar Jul 23 '24 14:07 alicefr

I think this looks good, just to be clear on semantics vm.status.volumeMigration will always represent the status of the current or last migration. If a user wanted to know if a migration is currently in progress they have to check the vmi. Or is there a VM condition?

mhenriks avatar Jul 26 '24 17:07 mhenriks

I think this looks good, just to be clear on semantics vm.status.volumeMigration will always represent the status of the current or last migration. If a user wanted to know if a migration is currently in progress they have to check the vmi. Or is there a VM condition?

When the VolumeChange condition is set to true, the the migration is still in progress. This condition is set on the VMI but it will be synchronized on the VM as well. So, it should be enough to look at the VM object. I still want to add a bool/condition in the volume migration to make it if the migration listed there finished successfully or not

alicefr avatar Jul 29 '24 07:07 alicefr

@mhenriks I added the conditions instead of the bool, could you please take another look? The commit https://github.com/kubevirt/kubevirt/pull/12355/commits/d9c88cade1a0d947b08ee1110e28445625c5cae1 describes how the conditions are set

alicefr avatar Jul 30 '24 13:07 alicefr

@vladikr @mhenriks after the discussion from yesterday, do you mind taking a look if you agree on the api and volume migration state in f1aed242b883f5d400e420e4a098b5e5d3f12296

alicefr avatar Aug 14 '24 13:08 alicefr

@vladikr @mhenriks I encapuslate the volume migration state in the volume update state. i'll ask tomorrow in the sig-api if the change is fine for them

alicefr avatar Aug 19 '24 10:08 alicefr

@mhenriks @vladikr are you happy with the current changes?

alicefr avatar Aug 22 '24 11:08 alicefr

@mhenriks PTAL

alicefr avatar Aug 26 '24 08:08 alicefr

@vladikr @mhenriks PTAL, I have updated the PR with the new APi

alicefr avatar Sep 12 '24 11:09 alicefr

@vladikr @mhenriks PTAL, I have updated the PR with the new APi

Thinking about this.. Shouldn't we also prevent the VM from booting if ManualRecoveryRequired?

vladikr avatar Sep 13 '24 17:09 vladikr

@vladikr @mhenriks PTAL, I have updated the PR with the new APi

Thinking about this.. Shouldn't we also prevent the VM from booting if ManualRecoveryRequired?

If we do that, then the user will have to unset that bit when they recover the volumes and that could be difficult since the field is in the status

mhenriks avatar Sep 13 '24 21:09 mhenriks

@vladikr @mhenriks we can think of introducing some restriction for booting. However, I really would like to merge this if we agree on the API. I think we can improve later KubeVirt's behavior

alicefr avatar Sep 16 '24 08:09 alicefr

The API looks fine to me now. I'm only worried about the functionality. I don't believe users will look at the VM status and will run into an unbootable state.

vladikr avatar Sep 16 '24 12:09 vladikr

The API looks fine to me now. I'm only worried about the functionality. I don't believe users will look at the VM status and will run into an unbootable state.

You are right. However, please consider that we don't expect this feature to be directly used by regular users but rather by overlaying tooling (migration plan tools or UI) which could check for this status and advertise the mismatch in a more user friendly way.

alicefr avatar Sep 16 '24 12:09 alicefr

/approve

mhenriks avatar Sep 16 '24 13:09 mhenriks

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mhenriks

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot avatar Sep 16 '24 13:09 kubevirt-bot

The API looks fine to me now. I'm only worried about the functionality. I don't believe users will look at the VM status and will run into an unbootable state.

You are right. However, please consider that we don't expect this feature to be directly used by regular users but rather by overlaying tooling (migration plan tools or UI) which could check for this status and advertise the mismatch in a more user friendly way.

Yes, that's correct, but still, we allow this operation to VM users, therefore each layer should be feature complete. We shouldn't allow users to run into such bad state situations. That fine to follow up on this in another PR, but I think it should be in the same release.

vladikr avatar Sep 16 '24 13:09 vladikr