kubevirt
kubevirt copied to clipboard
instancetype: Support Live Updates
What this PR does
https://github.com/kubevirt/community/pull/268
Before this PR:
Any attempt to change an instance type of a running VM with LiveUpdates
would be rejected.
After this PR:
Attempts to change an instance type of a running VM with LiveUpdates
will result in the same overall behaviour as when changing aspects of a vanilla VM.
$ ./cluster-up/kubectl.sh apply -k https://github.com/kubevirt/common-instancetypes
$ ./cluster-up/kubectl.sh patch kv/kubevirt -n kubevirt --type merge -p '{"spec":{"workloadUpdateStrategy":{"workloadUpdateMethods":["LiveMigrate"]},"configuration":{"developerConfiguration":{"featureGates": ["VMLiveUpdateFeatures"]},"vmRolloutStrategy": "LiveUpdate", "liveUpdateConfiguration": {"maxGuest": "8Gi"}}}}'
$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: alpine
spec:
instancetype:
name: u1.nano
runStrategy: Always
template:
metadata:
creationTimestamp: null
spec:
domain:
devices:
interfaces:
- masquerade: {}
name: default
machine:
resources: {}
networks:
- name: default
pod: {}
volumes:
- containerDisk:
image: registry:5000/kubevirt/alpine-container-disk-demo:devel
name: alpine
- cloudInitNoCloud:
userData: |
#!/bin/sh
echo 'printed from cloud-init userdata'
name: cloudinitdisk
EOF
$ ./cluster-up/kubectl.sh wait vms/alpine --for=condition=Ready
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.spec.domain.memory.guest}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestAtBoot}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestRequested}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestCurrent}'="512Mi"
$ ./cluster-up/kubectl.sh patch vms/alpine --type merge -p '{"spec":{"instancetype":{"name":"u1.micro","revisionName":""}}}'
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=condition=HotMemoryChange
$ while ! ./cluster-up/kubectl.sh get virtualmachineinstancemigrations -l kubevirt.io/vmi-name=alpine; do
sleep 1
done
$ ./cluster-up/kubectl.sh wait virtualmachineinstancemigrations -l kubevirt.io/vmi-name=alpine --for=jsonpath='{.status.phase}'="Succeeded"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestAtBoot}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestRequested}'="1Gi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestCurrent}'="1Gi"
Fixes #
Why we need it and why it was done in this way
The following tradeoffs were made:
The following alternatives were considered:
Links to places where the discussion took place:
Special notes for your reviewer
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.
- [x] Design: A design document was considered and is present (link) or not required
- [x] PR: The PR description is expressive enough and will help future contributors
- [x] Code: Write code that humans can understand and Keep it simple
- [ ] Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
- [ ] Upgrade: Impact of this change on upgrade flows was considered and addressed if required
- [ ] Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
- [ ] Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
- [ ] Community: Announcement to kubevirt-dev was considered
Release note
`LiveUpdates` of VMs using instance types are now supported with the same caveats as when making changes to a vanilla VM.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/area instancetype
Dumping my progress ahead of PTO, after removing a validation webhook check I can now get as far as the HotVCPUChange
appearing against the VM when switching instance types but no migration is envoked. That might be an issue with my kubevirtci
based env. Anyway back from the 18th hacking on this, comments welcome in the meantime but I will be rewriting/rebasing this on https://github.com/kubevirt/kubevirt/pull/11457 when I get back.
$ kustomize build https://github.com/kubevirt/common-instancetypes | ./cluster-up/kubectl.sh apply -f -
[..]
$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
creationTimestamp: null
name: fedora
spec:
instancetype:
name: u1.medium
preference:
name: fedora
runStrategy: Always
template:
metadata:
creationTimestamp: null
spec:
domain:
devices:
interfaces:
- name: internal
masquerade: {}
resources: {}
terminationGracePeriodSeconds: 180
volumes:
- containerDisk:
image: quay.io/containerdisks/fedora:39
name: fedora
status: {}
EOF
virtualmachine.kubevirt.io/fedora created
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
1
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"4Gi"
$ ./cluster-up/kubectl.sh patch vms/fedora --type merge -p '{"spec":{"instancetype":{"revisionName":"", "name":"u1.large"}}}'
virtualmachine.kubevirt.io/fedora patched
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"8Gi"
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
2
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .status.conditions
[
{
"lastProbeTime": null,
"lastTransitionTime": "2024-03-07T18:03:17Z",
"status": "True",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": null,
"status": "True",
"type": "LiveMigratable"
},
{
"lastProbeTime": "2024-03-07T18:03:27Z",
"lastTransitionTime": null,
"status": "True",
"type": "AgentConnected"
},
{
"lastProbeTime": null,
"lastTransitionTime": null,
"status": "True",
"type": "HotVCPUChange"
}
]
/hold
Just running this through CI before cleaning up the description and commit messages.
/hold cancel
/retest-required
Dumping my progress ahead of PTO, after removing a validation webhook check I can now get as far as the
HotVCPUChange
appearing against the VM when switching instance types but no migration is envoked. That might be an issue with mykubevirtci
based env. Anyway back from the 18th hacking on this, comments welcome in the meantime but I will be rewriting/rebasing this on #11457 when I get back.$ kustomize build https://github.com/kubevirt/common-instancetypes | ./cluster-up/kubectl.sh apply -f - [..] $ ./cluster-up/kubectl.sh apply -f - <<EOF apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: creationTimestamp: null name: fedora spec: instancetype: name: u1.medium preference: name: fedora runStrategy: Always template: metadata: creationTimestamp: null spec: domain: devices: interfaces: - name: internal masquerade: {} resources: {} terminationGracePeriodSeconds: 180 volumes: - containerDisk: image: quay.io/containerdisks/fedora:39 name: fedora status: {} EOF virtualmachine.kubevirt.io/fedora created $ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets 1 $ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest "4Gi" $ ./cluster-up/kubectl.sh patch vms/fedora --type merge -p '{"spec":{"instancetype":{"revisionName":"", "name":"u1.large"}}}' virtualmachine.kubevirt.io/fedora patched $ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest "8Gi" $ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets 2 $ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .status.conditions [ { "lastProbeTime": null, "lastTransitionTime": "2024-03-07T18:03:17Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": null, "status": "True", "type": "LiveMigratable" }, { "lastProbeTime": "2024-03-07T18:03:27Z", "lastTransitionTime": null, "status": "True", "type": "AgentConnected" }, { "lastProbeTime": null, "lastTransitionTime": null, "status": "True", "type": "HotVCPUChange" } ]
This is exactly what I was looking for, hopefully, the PR will get merged in the coming days!!
/hold
Bad rebase, I'll fix now.
/approve @lyarwood there's a commit marked WIP, probably needs fixing @enp0s3 you might be interested by the couple commits that move logic from the VMI controller to the webhook. It looks fine to me, since the code in question is all about applying default values, which is generally considered a good use of webhooks.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jean-edouard
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [jean-edouard]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@lyarwood there's a commit marked WIP, probably needs fixing
@jean-edouard thanks and yeah see https://github.com/kubevirt/kubevirt/pull/11873 now that I've reproduced it locally against HEAD, hopefully this isn't caused by something I have cached locally.
Leaving the /hold in place while we wait on https://github.com/kubevirt/kubevirt/pull/11873 that others are now seeing in CI. I'll rebase one final time when that lands.
/hold cancel
https://github.com/kubevirt/kubevirt/pull/11873 has landed and I've rebased.
/hold
While I look into the following failure, not sure if it's related to the changes in this PR at first glance:
Tests Suite: [sig-compute][Serial]CPU Hotplug A VM with cpu.maxSockets set higher than cpu.sockets [test_id:10822]should successfully plug guaranteed vCPUs expand_less 2m3s
{ Failure tests/hotplug/cpu.go:189
Timed out after 60.002s.
Expected
<[]interface {} | len:5, cap:5>: [
<map[string]interface {} | len:4>{
"lastProbeTime": nil,
"lastTransitionTime": <string>"2024-05-10T09:35:08Z",
"type": <string>"Ready",
"status": <string>"True",
},
<map[string]interface {} | len:4>{
"type": <string>"LiveMigratable",
"status": <string>"True",
"lastProbeTime": nil,
"lastTransitionTime": nil,
},
<map[string]interface {} | len:4>{
"lastProbeTime": nil,
"lastTransitionTime": nil,
"type": <string>"HotVCPUChange",
"status": <string>"True",
},
<map[string]interface {} | len:6>{
"message": <string>"server error. command SyncVMI failed: \"not enough exclusive threads provided, could not fit 2 core(s)\"",
"type": <string>"Synchronized",
"status": <string>"False",
"lastProbeTime": nil,
"lastTransitionTime": <string>"2024-05-10T09:35:08Z",
"reason": <string>"Synchronizing with the Domain failed.",
},
<map[string]interface {} | len:4>{
"type": <string>"AgentConnected",
"status": <string>"True",
"lastProbeTime": <string>"2024-05-10T09:35:13Z",
"lastTransitionTime": nil,
},
]
to not find condition of type 'HotVCPUChange' with status 'True'
tests/hotplug/cpu.go:245}
/hold
While I look into the following failure, not sure if it's related to the changes in this PR at first glance:
Status update is currently broken, @xpivarc is working on a fix.
/hold cancel /retest-required
/hold
will need to rebase on https://github.com/kubevirt/kubevirt/pull/11836
/hold cancel
/cc @acardace
Rebased on #11655, Antonio if you still have the context loaded feel free to review this series as it's touching the same area as your PR.
@xpivarc I can't reproduce this locally, smells like another race?
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-unit-test-arm64/1790046996891963392
• [FAILED] [0.114 seconds]
VirtualMachine One valid VirtualMachine controller given The RestartRequired condition [It] should appear when VM doesn't specify maxSockets and sockets go above cluster-wide maxSockets
pkg/virt-controller/watch/vm_test.go:6275
Timeline >>
STEP: Creating a VM with CPU sockets set to the cluster maxiumum @ 05/13/24 15:55:43.583
STEP: Creating a VMI with cluster max @ 05/13/24 15:55:43.595
STEP: Bumping the VM sockets above the cluster maximum @ 05/13/24 15:55:43.696
STEP: Executing the controller expecting the RestartRequired condition to appear @ 05/13/24 15:55:43.696
{"component":"","level":"info","msg":"Updating cluster config from KubeVirt to resource version '985bxnx4xb'","pos":"configuration.go:362","timestamp":"2024-05-13T15:55:43.696413Z"}
[FAILED] in [It] - pkg/virt-controller/watch/vm_test.go:6298 @ 05/13/24 15:55:43.696
<< Timeline
[FAILED] restart required
Expected
<[]v1.VirtualMachineCondition | len:2, cap:2>: [
{
Type: "Ready",
Status: "False",
LastProbeTime: {
Time: 2024-05-13T15:55:43.696507153Z,
},
LastTransitionTime: {
Time: 2024-05-13T15:55:43.696507153Z,
},
Reason: "VMIConditionMissing",
Message: "VMI is missing the Ready condition",
},
{
Type: "Initialized",
Status: "True",
LastProbeTime: {
Time: 0001-01-01T00:00:00Z,
},
LastTransitionTime: {
Time: 0001-01-01T00:00:00Z,
},
Reason: "",
Message: "",
},
]
to contain element matching
<*gstruct.FieldsMatcher | 0x4002c2a480>: {
Fields: {
"Type": <*matchers.EqualMatcher | 0x4001926ea0>{
Expected: <v1.VirtualMachineConditionType>"RestartRequired",
},
"Status": <*matchers.EqualMatcher | 0x4001926ec0>{
Expected: <v1.ConditionStatus>"True",
},
},
IgnoreExtras: true,
IgnoreMissing: false,
failures: [
<*errors.NestedError | 0x4002882[340](https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-unit-test-arm64/1790046996891963392#1:build-log.txt%3A340)>{
Path: ".Type",
Err: <*errors.errorString | 0x40019272e0>{
s: "Expected\n <v1.VirtualMachineConditionType>: Initialized\nto equal\n <v1.VirtualMachineConditionType>: RestartRequired",
},
},
],
}
In [It] at: pkg/virt-controller/watch/vm_test.go:6298 @ 05/13/24 15:55:43.696
/hold
Required labels detected, running phase 2 presubmits: /test pull-kubevirt-e2e-windows2016 /test pull-kubevirt-e2e-kind-sriov /test pull-kubevirt-e2e-k8s-1.30-ipv6-sig-network /test pull-kubevirt-e2e-k8s-1.28-sig-network /test pull-kubevirt-e2e-k8s-1.28-sig-storage /test pull-kubevirt-e2e-k8s-1.28-sig-compute /test pull-kubevirt-e2e-k8s-1.28-sig-operator /test pull-kubevirt-e2e-k8s-1.29-sig-network /test pull-kubevirt-e2e-k8s-1.29-sig-storage /test pull-kubevirt-e2e-k8s-1.29-sig-compute /test pull-kubevirt-e2e-k8s-1.29-sig-operator
New changes are detected. LGTM label has been removed.
@lyarwood thanks for the rebase! Is the hold still necessary?
@lyarwood thanks for the rebase! Is the hold still necessary?
Yeah I wanted to see if we still needed https://github.com/kubevirt/kubevirt/pull/11904 or https://github.com/kubevirt/kubevirt/pull/11836 and to work with folks on these fixes if we do.
@lyarwood: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-kubevirt-e2e-k8s-1.27-sig-performance | 991035b7ee4e446be50773e92aeaa5b5bcd9824d | link | true | /test pull-kubevirt-e2e-k8s-1.27-sig-performance |
pull-kubevirt-e2e-kind-1.27-vgpu | a7c77b3f9695c04e39a645778aca7a33c00651a8 | link | false | /test pull-kubevirt-e2e-kind-1.27-vgpu |
pull-kubevirt-e2e-k8s-1.29-sig-storage | b0c88438f2f7b633ca3cdae8eee9d68df6ce203e | link | true | /test pull-kubevirt-e2e-k8s-1.29-sig-storage |
pull-kubevirt-e2e-k8s-1.29-sig-compute | b0c88438f2f7b633ca3cdae8eee9d68df6ce203e | link | true | /test pull-kubevirt-e2e-k8s-1.29-sig-compute |
pull-kubevirt-e2e-k8s-1.29-sig-operator | b0c88438f2f7b633ca3cdae8eee9d68df6ce203e | link | true | /test pull-kubevirt-e2e-k8s-1.29-sig-operator |
pull-kubevirt-check-tests-for-flakes | ab14750147e4c8a1cee368d8c7f7c53f46a18ca8 | link | false | /test pull-kubevirt-check-tests-for-flakes |
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations | ab14750147e4c8a1cee368d8c7f7c53f46a18ca8 | link | true | /test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
@lyarwood TODO create a docs update PR once this one got and the DP will be merged
The tests fail because of a change introduced in #11962:
- The Instancetype matcher is changed here (this PR): https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/tests/hotplug/instancetype.go#L139
- This always adds a RestartRequired condition if a matcher was changed (added in #11962) https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/pkg/virt-controller/watch/vm.go#L3030
IIUC this block above already ensures there are only hotpluggable changes, so we should be fine dropping the check introduced in #11962 again.
https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/pkg/virt-controller/watch/vm.go#L3019-L3028
Updated to fix a nil-pointer access in handleMemoryHotplugRequest
.
See logs for panic: https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations/1795451827265736704/artifacts/k8s-reporter/pods/1_kubevirt_virt-controller-69b7c8c4b8-wpxbx-virt-controller.log
Missed another nil-pointer access in handleMemoryHotplugRequest
, updated again.
See https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations/1795490607427227648/artifacts/k8s-reporter/pods/1_kubevirt_virt-controller-5c6f74b5d5-6ht4x-virt-controller_previous.log
/retest-required
/test pull-kubevirt-e2e-k8s-1.30-sig-compute