kubevirt instancetype: Support Live Updates

What this PR does

https://github.com/kubevirt/community/pull/268

Before this PR:

Any attempt to change an instance type of a running VM with LiveUpdates would be rejected.

After this PR:

Attempts to change an instance type of a running VM with LiveUpdates will result in the same overall behaviour as when changing aspects of a vanilla VM.

$ ./cluster-up/kubectl.sh apply -k https://github.com/kubevirt/common-instancetypes

$ ./cluster-up/kubectl.sh patch kv/kubevirt -n kubevirt --type merge -p '{"spec":{"workloadUpdateStrategy":{"workloadUpdateMethods":["LiveMigrate"]},"configuration":{"developerConfiguration":{"featureGates": ["VMLiveUpdateFeatures"]},"vmRolloutStrategy": "LiveUpdate", "liveUpdateConfiguration": {"maxGuest": "8Gi"}}}}'

$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: alpine
spec:
  instancetype:
   name: u1.nano
  runStrategy: Always
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          interfaces:
          - masquerade: {}
            name: default
        machine:
        resources: {}
      networks:
      - name: default
        pod: {}
      volumes:
      - containerDisk:
          image: registry:5000/kubevirt/alpine-container-disk-demo:devel
        name: alpine
      - cloudInitNoCloud:
          userData: |
            #!/bin/sh
            echo 'printed from cloud-init userdata'
        name: cloudinitdisk
EOF

$ ./cluster-up/kubectl.sh wait vms/alpine --for=condition=Ready
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.spec.domain.memory.guest}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestAtBoot}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestRequested}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestCurrent}'="512Mi"

$ ./cluster-up/kubectl.sh patch vms/alpine --type merge -p '{"spec":{"instancetype":{"name":"u1.micro","revisionName":""}}}'

$ ./cluster-up/kubectl.sh wait vmis/alpine --for=condition=HotMemoryChange

$ while ! ./cluster-up/kubectl.sh get virtualmachineinstancemigrations -l kubevirt.io/vmi-name=alpine; do
  sleep 1
done 

$ ./cluster-up/kubectl.sh wait virtualmachineinstancemigrations -l kubevirt.io/vmi-name=alpine --for=jsonpath='{.status.phase}'="Succeeded"

$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestAtBoot}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestRequested}'="1Gi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestCurrent}'="1Gi"

Fixes #

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.

[x] Design: A design document was considered and is present (link) or not required
[x] PR: The PR description is expressive enough and will help future contributors
[x] Code: Write code that humans can understand and Keep it simple
[ ] Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
[ ] Upgrade: Impact of this change on upgrade flows was considered and addressed if required
[ ] Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
[ ] Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
[ ] Community: Announcement to kubevirt-dev was considered

Release note

`LiveUpdates`  of VMs using instance types are now supported with the same caveats as when making changes to a vanilla VM.

Mar 06 '24 16:03 lyarwood

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

Mar 06 '24 16:03 kubevirt-bot

/area instancetype

Mar 06 '24 18:03 lyarwood

Dumping my progress ahead of PTO, after removing a validation webhook check I can now get as far as the HotVCPUChange appearing against the VM when switching instance types but no migration is envoked. That might be an issue with my kubevirtci based env. Anyway back from the 18th hacking on this, comments welcome in the meantime but I will be rewriting/rebasing this on https://github.com/kubevirt/kubevirt/pull/11457 when I get back.

$ kustomize build https://github.com/kubevirt/common-instancetypes | ./cluster-up/kubectl.sh apply -f -
[..]
$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  creationTimestamp: null
  name: fedora
spec:
  instancetype:
    name: u1.medium
  preference:
    name: fedora
  runStrategy: Always
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          interfaces:
          - name: internal
            masquerade: {}
        resources: {}
      terminationGracePeriodSeconds: 180
      volumes:
      - containerDisk:
          image: quay.io/containerdisks/fedora:39
        name: fedora
status: {}
EOF
virtualmachine.kubevirt.io/fedora created
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
1
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"4Gi"
$ ./cluster-up/kubectl.sh patch vms/fedora --type merge -p '{"spec":{"instancetype":{"revisionName":"", "name":"u1.large"}}}'
virtualmachine.kubevirt.io/fedora patched
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"8Gi"
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
2
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .status.conditions
[
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2024-03-07T18:03:17Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "LiveMigratable"
  },
  {
    "lastProbeTime": "2024-03-07T18:03:27Z",
    "lastTransitionTime": null,
    "status": "True",
    "type": "AgentConnected"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "HotVCPUChange"
  }
]

Mar 07 '24 18:03 lyarwood

/hold

Just running this through CI before cleaning up the description and commit messages.

Apr 29 '24 16:04 lyarwood

/hold cancel

Apr 30 '24 13:04 lyarwood

/retest-required

May 01 '24 08:05 lyarwood

Dumping my progress ahead of PTO, after removing a validation webhook check I can now get as far as the HotVCPUChange appearing against the VM when switching instance types but no migration is envoked. That might be an issue with my kubevirtci based env. Anyway back from the 18th hacking on this, comments welcome in the meantime but I will be rewriting/rebasing this on #11457 when I get back.

$ kustomize build https://github.com/kubevirt/common-instancetypes | ./cluster-up/kubectl.sh apply -f -
[..]
$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  creationTimestamp: null
  name: fedora
spec:
  instancetype:
    name: u1.medium
  preference:
    name: fedora
  runStrategy: Always
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          interfaces:
          - name: internal
            masquerade: {}
        resources: {}
      terminationGracePeriodSeconds: 180
      volumes:
      - containerDisk:
          image: quay.io/containerdisks/fedora:39
        name: fedora
status: {}
EOF
virtualmachine.kubevirt.io/fedora created
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
1
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"4Gi"
$ ./cluster-up/kubectl.sh patch vms/fedora --type merge -p '{"spec":{"instancetype":{"revisionName":"", "name":"u1.large"}}}'
virtualmachine.kubevirt.io/fedora patched
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"8Gi"
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
2
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .status.conditions
[
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2024-03-07T18:03:17Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "LiveMigratable"
  },
  {
    "lastProbeTime": "2024-03-07T18:03:27Z",
    "lastTransitionTime": null,
    "status": "True",
    "type": "AgentConnected"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "HotVCPUChange"
  }
]

This is exactly what I was looking for, hopefully, the PR will get merged in the coming days!!

May 01 '24 16:05 govindkailas

/hold

Bad rebase, I'll fix now.

May 07 '24 12:05 lyarwood

/approve @lyarwood there's a commit marked WIP, probably needs fixing @enp0s3 you might be interested by the couple commits that move logic from the VMI controller to the webhook. It looks fine to me, since the code in question is all about applying default values, which is generally considered a good use of webhooks.

May 08 '24 13:05 jean-edouard

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jean-edouard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jean-edouard]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 08 '24 13:05 kubevirt-bot

@lyarwood there's a commit marked WIP, probably needs fixing

@jean-edouard thanks and yeah see https://github.com/kubevirt/kubevirt/pull/11873 now that I've reproduced it locally against HEAD, hopefully this isn't caused by something I have cached locally.

May 08 '24 13:05 lyarwood

Leaving the /hold in place while we wait on https://github.com/kubevirt/kubevirt/pull/11873 that others are now seeing in CI. I'll rebase one final time when that lands.

May 09 '24 09:05 lyarwood

/hold cancel

https://github.com/kubevirt/kubevirt/pull/11873 has landed and I've rebased.

May 10 '24 09:05 lyarwood

/hold

While I look into the following failure, not sure if it's related to the changes in this PR at first glance:

Tests Suite: [sig-compute][Serial]CPU Hotplug A VM with cpu.maxSockets set higher than cpu.sockets [test_id:10822]should successfully plug guaranteed vCPUs expand_less	2m3s
{ Failure tests/hotplug/cpu.go:189
Timed out after 60.002s.
Expected
    <[]interface {} | len:5, cap:5>: [
        <map[string]interface {} | len:4>{
            "lastProbeTime": nil,
            "lastTransitionTime": <string>"2024-05-10T09:35:08Z",
            "type": <string>"Ready",
            "status": <string>"True",
        },
        <map[string]interface {} | len:4>{
            "type": <string>"LiveMigratable",
            "status": <string>"True",
            "lastProbeTime": nil,
            "lastTransitionTime": nil,
        },
        <map[string]interface {} | len:4>{
            "lastProbeTime": nil,
            "lastTransitionTime": nil,
            "type": <string>"HotVCPUChange",
            "status": <string>"True",
        },
        <map[string]interface {} | len:6>{
            "message": <string>"server error. command SyncVMI failed: \"not enough exclusive threads provided, could not fit 2 core(s)\"",
            "type": <string>"Synchronized",
            "status": <string>"False",
            "lastProbeTime": nil,
            "lastTransitionTime": <string>"2024-05-10T09:35:08Z",
            "reason": <string>"Synchronizing with the Domain failed.",
        },
        <map[string]interface {} | len:4>{
            "type": <string>"AgentConnected",
            "status": <string>"True",
            "lastProbeTime": <string>"2024-05-10T09:35:13Z",
            "lastTransitionTime": nil,
        },
    ]
to not find condition of type 'HotVCPUChange' with status 'True'
tests/hotplug/cpu.go:245}

May 10 '24 10:05 lyarwood

/hold

While I look into the following failure, not sure if it's related to the changes in this PR at first glance:

Status update is currently broken, @xpivarc is working on a fix.

May 10 '24 11:05 jean-edouard

/hold cancel /retest-required

May 10 '24 19:05 lyarwood

/hold

will need to rebase on https://github.com/kubevirt/kubevirt/pull/11836

May 13 '24 08:05 lyarwood

/hold cancel

May 13 '24 09:05 lyarwood

/cc @acardace

Rebased on #11655, Antonio if you still have the context loaded feel free to review this series as it's touching the same area as your PR.

May 13 '24 15:05 lyarwood

@xpivarc I can't reproduce this locally, smells like another race?

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-unit-test-arm64/1790046996891963392

• [FAILED] [0.114 seconds]
VirtualMachine One valid VirtualMachine controller given The RestartRequired condition [It] should appear when VM doesn't specify maxSockets and sockets go above cluster-wide maxSockets
pkg/virt-controller/watch/vm_test.go:6275
  Timeline >>
  STEP: Creating a VM with CPU sockets set to the cluster maxiumum @ 05/13/24 15:55:43.583
  STEP: Creating a VMI with cluster max @ 05/13/24 15:55:43.595
  STEP: Bumping the VM sockets above the cluster maximum @ 05/13/24 15:55:43.696
  STEP: Executing the controller expecting the RestartRequired condition to appear @ 05/13/24 15:55:43.696
  {"component":"","level":"info","msg":"Updating cluster config from KubeVirt to resource version '985bxnx4xb'","pos":"configuration.go:362","timestamp":"2024-05-13T15:55:43.696413Z"}
  [FAILED] in [It] - pkg/virt-controller/watch/vm_test.go:6298 @ 05/13/24 15:55:43.696
  << Timeline

[FAILED] restart required
  Expected
      <[]v1.VirtualMachineCondition | len:2, cap:2>: [
          {
              Type: "Ready",
              Status: "False",
              LastProbeTime: {
                  Time: 2024-05-13T15:55:43.696507153Z,
              },
              LastTransitionTime: {
                  Time: 2024-05-13T15:55:43.696507153Z,
              },
              Reason: "VMIConditionMissing",
              Message: "VMI is missing the Ready condition",
          },
          {
              Type: "Initialized",
              Status: "True",
              LastProbeTime: {
                  Time: 0001-01-01T00:00:00Z,
              },
              LastTransitionTime: {
                  Time: 0001-01-01T00:00:00Z,
              },
              Reason: "",
              Message: "",
          },
      ]
  to contain element matching
      <*gstruct.FieldsMatcher | 0x4002c2a480>: {
          Fields: {
              "Type": <*matchers.EqualMatcher | 0x4001926ea0>{
                  Expected: <v1.VirtualMachineConditionType>"RestartRequired",
              },
              "Status": <*matchers.EqualMatcher | 0x4001926ec0>{
                  Expected: <v1.ConditionStatus>"True",
              },
          },
          IgnoreExtras: true,
          IgnoreMissing: false,
          failures: [
              <*errors.NestedError | 0x4002882[340](https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-unit-test-arm64/1790046996891963392#1:build-log.txt%3A340)>{
                  Path: ".Type",
                  Err: <*errors.errorString | 0x40019272e0>{
                      s: "Expected\n    <v1.VirtualMachineConditionType>: Initialized\nto equal\n    <v1.VirtualMachineConditionType>: RestartRequired",
                  },
              },
          ],
      }
  In [It] at: pkg/virt-controller/watch/vm_test.go:6298 @ 05/13/24 15:55:43.696

May 13 '24 16:05 lyarwood

/hold

May 13 '24 16:05 lyarwood

Required labels detected, running phase 2 presubmits: /test pull-kubevirt-e2e-windows2016 /test pull-kubevirt-e2e-kind-sriov /test pull-kubevirt-e2e-k8s-1.30-ipv6-sig-network /test pull-kubevirt-e2e-k8s-1.28-sig-network /test pull-kubevirt-e2e-k8s-1.28-sig-storage /test pull-kubevirt-e2e-k8s-1.28-sig-compute /test pull-kubevirt-e2e-k8s-1.28-sig-operator /test pull-kubevirt-e2e-k8s-1.29-sig-network /test pull-kubevirt-e2e-k8s-1.29-sig-storage /test pull-kubevirt-e2e-k8s-1.29-sig-compute /test pull-kubevirt-e2e-k8s-1.29-sig-operator

May 15 '24 12:05 kubevirt-commenter-bot

New changes are detected. LGTM label has been removed.

May 20 '24 12:05 kubevirt-bot

@lyarwood thanks for the rebase! Is the hold still necessary?

May 20 '24 12:05 jean-edouard

@lyarwood thanks for the rebase! Is the hold still necessary?

Yeah I wanted to see if we still needed https://github.com/kubevirt/kubevirt/pull/11904 or https://github.com/kubevirt/kubevirt/pull/11836 and to work with folks on these fixes if we do.

May 20 '24 12:05 lyarwood

@lyarwood: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubevirt-e2e-k8s-1.27-sig-performance	991035b7ee4e446be50773e92aeaa5b5bcd9824d	link	true	`/test pull-kubevirt-e2e-k8s-1.27-sig-performance`
pull-kubevirt-e2e-kind-1.27-vgpu	a7c77b3f9695c04e39a645778aca7a33c00651a8	link	false	`/test pull-kubevirt-e2e-kind-1.27-vgpu`
pull-kubevirt-e2e-k8s-1.29-sig-storage	b0c88438f2f7b633ca3cdae8eee9d68df6ce203e	link	true	`/test pull-kubevirt-e2e-k8s-1.29-sig-storage`
pull-kubevirt-e2e-k8s-1.29-sig-compute	b0c88438f2f7b633ca3cdae8eee9d68df6ce203e	link	true	`/test pull-kubevirt-e2e-k8s-1.29-sig-compute`
pull-kubevirt-e2e-k8s-1.29-sig-operator	b0c88438f2f7b633ca3cdae8eee9d68df6ce203e	link	true	`/test pull-kubevirt-e2e-k8s-1.29-sig-operator`
pull-kubevirt-check-tests-for-flakes	ab14750147e4c8a1cee368d8c7f7c53f46a18ca8	link	false	`/test pull-kubevirt-check-tests-for-flakes`
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations	ab14750147e4c8a1cee368d8c7f7c53f46a18ca8	link	true	`/test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

May 20 '24 14:05 kubevirt-bot

@lyarwood TODO create a docs update PR once this one got and the DP will be merged

May 28 '24 12:05 enp0s3

The tests fail because of a change introduced in #11962:

The Instancetype matcher is changed here (this PR): https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/tests/hotplug/instancetype.go#L139
This always adds a RestartRequired condition if a matcher was changed (added in #11962) https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/pkg/virt-controller/watch/vm.go#L3030

IIUC this block above already ensures there are only hotpluggable changes, so we should be fine dropping the check introduced in #11962 again.

https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/pkg/virt-controller/watch/vm.go#L3019-L3028

May 28 '24 13:05 0xFelix

Updated to fix a nil-pointer access in handleMemoryHotplugRequest.

See logs for panic: https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations/1795451827265736704/artifacts/k8s-reporter/pods/1_kubevirt_virt-controller-69b7c8c4b8-wpxbx-virt-controller.log

May 28 '24 16:05 0xFelix

Missed another nil-pointer access in handleMemoryHotplugRequest, updated again.

See https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations/1795490607427227648/artifacts/k8s-reporter/pods/1_kubevirt_virt-controller-5c6f74b5d5-6ht4x-virt-controller_previous.log

May 29 '24 07:05 0xFelix

/retest-required

May 29 '24 13:05 0xFelix

/test pull-kubevirt-e2e-k8s-1.30-sig-compute

May 29 '24 15:05 0xFelix

kubevirt kubevirt copied to clipboard

instancetype: Support Live Updates

What this PR does

Why we need it and why it was done in this way

Special notes for your reviewer

Checklist

Release note

kubevirt
kubevirt copied to clipboard