kubevirt
kubevirt copied to clipboard
volume migration: persist the volume migration state in the VM status
What this PR does
Before this PR: The information about the volume migration was lost if the VMI was deleted for any reason or the VM shutdown
After this PR: Add the volume migration state in the VM status to persist the information
Release note
Add the volume migration state in the VM status
/cc @awels @mhenriks
Can you talk more about how storing this info in the VM is useful to users? Are they expected to do something when a stopped VM has volume migrations in progress (like rollback the configuration)? Otherwise will the migration be resumed when the VM is restarted? Or will it be aborted? I don't see that the VM controller acts on this info when the VM is restarted.
Can you talk more about how storing this info in the VM is useful to users? Are they expected to do something when a stopped VM has volume migrations in progress (like rollback the configuration)? Otherwise will the migration be resumed when the VM is restarted? Or will it be aborted? I don't see that the VM controller acts on this info when the VM is restarted.
Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes. We have 2 options:
- Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes
- Simply add the information in the VM status and the user need to take action
With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.
Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.
Why can't the user rollback to the old volumes? And where is this enforced?
We have 2 options:
Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes
Can you actually do this without adding something to the VM status?
Simply add the information in the VM status and the user need to take action
With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.
What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data? Also won't the VM status.migratedVolumes field be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?
To me, the first option makes more sense. If a user is going to go through the trouble of manually doing offline copies, they can also do something like update the VM spec by removing spec.UpdateVolumeStrategy.
Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.
Why can't the user rollback to the old volumes? And where is this enforced?
Users should rollback to the old volumes, but we don't want to enforce it. What if the users decide to do a csi clone while the VM is offline.
We have 2 options:
Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes
Can you actually do this without adding something to the VM status?
I don't see how since the VMI is gone
Simply add the information in the VM status and the user need to take action
With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.
What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data?
Yes, this is possible, like the VM might not be bootable.
Also won't the VM
status.migratedVolumesfield be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?
The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.
To me, the first option makes more sense. If a user is going to go through the trouble of manually doing offline copies, they can also do something like update the VM spec by removing
spec.UpdateVolumeStrategy.
I think both options are doable, but the second is easier to implement. @awels WDYT?
The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.
Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.
Why can't the user rollback to the old volumes? And where is this enforced?
Users should rollback to the old volumes, but we don't want to enforce it. What if the users decide to do a csi clone while the VM is offline.
Ah okay I think I was confused "we" in this case refers to virt-controller not users
We have 2 options:
Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes
Can you actually do this without adding something to the VM status?
I don't see how since the VMI is gone
Just wanted to confirm that this change (add fields to status) is necessary regardless of which option we choose
Simply add the information in the VM status and the user need to take action
With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.
What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data?
Yes, this is possible, like the VM might not be bootable.
Also won't the VM
status.migratedVolumesfield be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.
Okay I assumed GenerateMigratedVolumes gets called whenever the VM reconciles and won't have "old" volume info
To me, the first option makes more sense. If a user is going to go through the trouble of manually doing offline copies, they can also do something like update the VM spec by removing
spec.UpdateVolumeStrategy.I think both options are doable, but the second is easier to implement. @awels WDYT?
Just to add my 2 cents to this conversation. We want to store somewhere the names of the old volumes in a persistent fashion. The VMI is not suitable for this since it can disappear when the VM is stopped for whatever reason. The user may or may not know what the original values are so having it stored somewhere is helpful.
In my migration controller I have the ability to keep track of the volume names but I would prefer to read it from the real source the VM. I feel option 2 is more flexible for the user, as at this point the VM is probably down and doing a different copy might be preferable at this point.
The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.
Yes, the users need to decide what to do on a crashed volume migration since we aren't allow to patch the VM spec with the old volumes.
Why can't the user rollback to the old volumes? And where is this enforced?
Users should rollback to the old volumes, but we don't want to enforce it. What if the users decide to do a csi clone while the VM is offline.
Ah okay I think I was confused "we" in this case refers to virt-controller not users
We have 2 options:
Detect on a VM start that the VM is in a inconsistent state because of the migrated volumes and refuse to start until the user revert the VM spec to the old set of volumes
Can you actually do this without adding something to the VM status?
I don't see how since the VMI is gone
Just wanted to confirm that this change (add fields to status) is necessary regardless of which option we choose
Yes, it is required since it is the only object which persists.
Simply add the information in the VM status and the user need to take action
With @awels we decided the second option because the first is too restrictive. For example, the user could decide to instead perform an offline copy while the VM is off.
What happens if the user does not take action and then attempts to start the VM? Is it possible that the VM will start but be missing data?
Yes, this is possible, like the VM might not be bootable.
Also won't the VM
status.migratedVolumesfield be cleared when the VM is restarted regardless of whether the VM actually starts or not? How will the user know that the VM is in a funky state because of aborted migration?The migratedVolumes aren't cleared on the VM, they will stay there until the next volume migration update.
Okay I assumed
GenerateMigratedVolumesgets called whenever the VM reconciles and won't have "old" volume info
It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.
It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.
This is good I just wanted to make sure this info doesn't get lost right away
It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.
This is good I just wanted to make sure this info doesn't get lost right away
What happens to this new field when a migration completes successfully (happy path)? Does it get cleared?
It will be updated on the next volume update with the migration strategy. In the meantime, you can still see the original and destination volumes. The main reason for keeping this, is if there are any issues with the copy, like not bootable device, you can still revert to the previous volumes having the information available. We cannot know from KubeVirt when the VM correctly boots without checking inside the guest, like pinging the guest agent.
This is good I just wanted to make sure this info doesn't get lost right away
What happens to this new field when a migration completes successfully (happy path)? Does it get cleared?
It is cleared on the VMI but not on the VM. On the next volume update, the migrated volumes list will be updated with the new change, but since then, we keep track that there has been a volume migration and the source/destination couples.
EDIT: no difference between the happy path and a failure
@mhenriks as we discussed offline, I have encapsulated the migrated volumes in a new struct in the VM status. Additionally, I added a bool to signal if the migration was successful or not. Unfortunately, this required also to add a new field in the VMI status as well. WDYT?
/test pull-kubevirt-e2e-k8s-1.30-sig-storage
@vladikr would you mind to take a look to this PR when you have a bit of time?
I think this looks good, just to be clear on semantics vm.status.volumeMigration will always represent the status of the current or last migration. If a user wanted to know if a migration is currently in progress they have to check the vmi. Or is there a VM condition?
I think this looks good, just to be clear on semantics
vm.status.volumeMigrationwill always represent the status of the current or last migration. If a user wanted to know if a migration is currently in progress they have to check the vmi. Or is there a VM condition?
When the VolumeChange condition is set to true, the the migration is still in progress. This condition is set on the VMI but it will be synchronized on the VM as well. So, it should be enough to look at the VM object. I still want to add a bool/condition in the volume migration to make it if the migration listed there finished successfully or not
@mhenriks I added the conditions instead of the bool, could you please take another look? The commit https://github.com/kubevirt/kubevirt/pull/12355/commits/d9c88cade1a0d947b08ee1110e28445625c5cae1 describes how the conditions are set
@vladikr @mhenriks after the discussion from yesterday, do you mind taking a look if you agree on the api and volume migration state in f1aed242b883f5d400e420e4a098b5e5d3f12296
@vladikr @mhenriks I encapuslate the volume migration state in the volume update state. i'll ask tomorrow in the sig-api if the change is fine for them
@mhenriks @vladikr are you happy with the current changes?
@mhenriks PTAL
@vladikr @mhenriks PTAL, I have updated the PR with the new APi
@vladikr @mhenriks PTAL, I have updated the PR with the new APi
Thinking about this.. Shouldn't we also prevent the VM from booting if ManualRecoveryRequired?
@vladikr @mhenriks PTAL, I have updated the PR with the new APi
Thinking about this.. Shouldn't we also prevent the VM from booting if ManualRecoveryRequired?
If we do that, then the user will have to unset that bit when they recover the volumes and that could be difficult since the field is in the status
@vladikr @mhenriks we can think of introducing some restriction for booting. However, I really would like to merge this if we agree on the API. I think we can improve later KubeVirt's behavior
The API looks fine to me now. I'm only worried about the functionality. I don't believe users will look at the VM status and will run into an unbootable state.
The API looks fine to me now. I'm only worried about the functionality. I don't believe users will look at the VM status and will run into an unbootable state.
You are right. However, please consider that we don't expect this feature to be directly used by regular users but rather by overlaying tooling (migration plan tools or UI) which could check for this status and advertise the mismatch in a more user friendly way.
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mhenriks
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mhenriks]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
The API looks fine to me now. I'm only worried about the functionality. I don't believe users will look at the VM status and will run into an unbootable state.
You are right. However, please consider that we don't expect this feature to be directly used by regular users but rather by overlaying tooling (migration plan tools or UI) which could check for this status and advertise the mismatch in a more user friendly way.
Yes, that's correct, but still, we allow this operation to VM users, therefore each layer should be feature complete. We shouldn't allow users to run into such bad state situations. That fine to follow up on this in another PR, but I think it should be in the same release.