kubevirt icon indicating copy to clipboard operation
kubevirt copied to clipboard

instancetype: Support Live Updates

Open lyarwood opened this issue 11 months ago • 27 comments

What this PR does

https://github.com/kubevirt/community/pull/268

Before this PR:

Any attempt to change an instance type of a running VM with LiveUpdates would be rejected.

After this PR:

Attempts to change an instance type of a running VM with LiveUpdates will result in the same overall behaviour as when changing aspects of a vanilla VM.

$ ./cluster-up/kubectl.sh apply -k https://github.com/kubevirt/common-instancetypes

$ ./cluster-up/kubectl.sh patch kv/kubevirt -n kubevirt --type merge -p '{"spec":{"workloadUpdateStrategy":{"workloadUpdateMethods":["LiveMigrate"]},"configuration":{"developerConfiguration":{"featureGates": ["VMLiveUpdateFeatures"]},"vmRolloutStrategy": "LiveUpdate", "liveUpdateConfiguration": {"maxGuest": "8Gi"}}}}'

$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: alpine
spec:
  instancetype:
   name: u1.nano
  runStrategy: Always
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          interfaces:
          - masquerade: {}
            name: default
        machine:
        resources: {}
      networks:
      - name: default
        pod: {}
      volumes:
      - containerDisk:
          image: registry:5000/kubevirt/alpine-container-disk-demo:devel
        name: alpine
      - cloudInitNoCloud:
          userData: |
            #!/bin/sh
            echo 'printed from cloud-init userdata'
        name: cloudinitdisk
EOF

$ ./cluster-up/kubectl.sh wait vms/alpine --for=condition=Ready
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.spec.domain.memory.guest}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestAtBoot}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestRequested}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestCurrent}'="512Mi"

$ ./cluster-up/kubectl.sh patch vms/alpine --type merge -p '{"spec":{"instancetype":{"name":"u1.micro","revisionName":""}}}'

$ ./cluster-up/kubectl.sh wait vmis/alpine --for=condition=HotMemoryChange

$ while ! ./cluster-up/kubectl.sh get virtualmachineinstancemigrations -l kubevirt.io/vmi-name=alpine; do
  sleep 1
done 

$ ./cluster-up/kubectl.sh wait virtualmachineinstancemigrations -l kubevirt.io/vmi-name=alpine --for=jsonpath='{.status.phase}'="Succeeded"

$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestAtBoot}'="512Mi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestRequested}'="1Gi"
$ ./cluster-up/kubectl.sh wait vmis/alpine --for=jsonpath='{.status.memory.guestCurrent}'="1Gi"

Fixes #

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.

Release note

`LiveUpdates`  of VMs using instance types are now supported with the same caveats as when making changes to a vanilla VM.

lyarwood avatar Mar 06 '24 16:03 lyarwood

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

kubevirt-bot avatar Mar 06 '24 16:03 kubevirt-bot

/area instancetype

lyarwood avatar Mar 06 '24 18:03 lyarwood

Dumping my progress ahead of PTO, after removing a validation webhook check I can now get as far as the HotVCPUChange appearing against the VM when switching instance types but no migration is envoked. That might be an issue with my kubevirtci based env. Anyway back from the 18th hacking on this, comments welcome in the meantime but I will be rewriting/rebasing this on https://github.com/kubevirt/kubevirt/pull/11457 when I get back.

$ kustomize build https://github.com/kubevirt/common-instancetypes | ./cluster-up/kubectl.sh apply -f -
[..]
$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  creationTimestamp: null
  name: fedora
spec:
  instancetype:
    name: u1.medium
  preference:
    name: fedora
  runStrategy: Always
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          interfaces:
          - name: internal
            masquerade: {}
        resources: {}
      terminationGracePeriodSeconds: 180
      volumes:
      - containerDisk:
          image: quay.io/containerdisks/fedora:39
        name: fedora
status: {}
EOF
virtualmachine.kubevirt.io/fedora created
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
1
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"4Gi"
$ ./cluster-up/kubectl.sh patch vms/fedora --type merge -p '{"spec":{"instancetype":{"revisionName":"", "name":"u1.large"}}}'
virtualmachine.kubevirt.io/fedora patched
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"8Gi"
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
2
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .status.conditions
[
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2024-03-07T18:03:17Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "LiveMigratable"
  },
  {
    "lastProbeTime": "2024-03-07T18:03:27Z",
    "lastTransitionTime": null,
    "status": "True",
    "type": "AgentConnected"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "HotVCPUChange"
  }
]

lyarwood avatar Mar 07 '24 18:03 lyarwood

/hold

Just running this through CI before cleaning up the description and commit messages.

lyarwood avatar Apr 29 '24 16:04 lyarwood

/hold cancel

lyarwood avatar Apr 30 '24 13:04 lyarwood

/retest-required

lyarwood avatar May 01 '24 08:05 lyarwood

Dumping my progress ahead of PTO, after removing a validation webhook check I can now get as far as the HotVCPUChange appearing against the VM when switching instance types but no migration is envoked. That might be an issue with my kubevirtci based env. Anyway back from the 18th hacking on this, comments welcome in the meantime but I will be rewriting/rebasing this on #11457 when I get back.

$ kustomize build https://github.com/kubevirt/common-instancetypes | ./cluster-up/kubectl.sh apply -f -
[..]
$ ./cluster-up/kubectl.sh apply -f - <<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  creationTimestamp: null
  name: fedora
spec:
  instancetype:
    name: u1.medium
  preference:
    name: fedora
  runStrategy: Always
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          interfaces:
          - name: internal
            masquerade: {}
        resources: {}
      terminationGracePeriodSeconds: 180
      volumes:
      - containerDisk:
          image: quay.io/containerdisks/fedora:39
        name: fedora
status: {}
EOF
virtualmachine.kubevirt.io/fedora created
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
1
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"4Gi"
$ ./cluster-up/kubectl.sh patch vms/fedora --type merge -p '{"spec":{"instancetype":{"revisionName":"", "name":"u1.large"}}}'
virtualmachine.kubevirt.io/fedora patched
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.memory.guest
"8Gi"
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .spec.domain.cpu.sockets
2
$ ./cluster-up/kubectl.sh get vmis/fedora -o json | jq .status.conditions
[
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2024-03-07T18:03:17Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "LiveMigratable"
  },
  {
    "lastProbeTime": "2024-03-07T18:03:27Z",
    "lastTransitionTime": null,
    "status": "True",
    "type": "AgentConnected"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": null,
    "status": "True",
    "type": "HotVCPUChange"
  }
]

This is exactly what I was looking for, hopefully, the PR will get merged in the coming days!!

govindkailas avatar May 01 '24 16:05 govindkailas

/hold

Bad rebase, I'll fix now.

lyarwood avatar May 07 '24 12:05 lyarwood

/approve @lyarwood there's a commit marked WIP, probably needs fixing @enp0s3 you might be interested by the couple commits that move logic from the VMI controller to the webhook. It looks fine to me, since the code in question is all about applying default values, which is generally considered a good use of webhooks.

jean-edouard avatar May 08 '24 13:05 jean-edouard

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jean-edouard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot avatar May 08 '24 13:05 kubevirt-bot

@lyarwood there's a commit marked WIP, probably needs fixing

@jean-edouard thanks and yeah see https://github.com/kubevirt/kubevirt/pull/11873 now that I've reproduced it locally against HEAD, hopefully this isn't caused by something I have cached locally.

lyarwood avatar May 08 '24 13:05 lyarwood

Leaving the /hold in place while we wait on https://github.com/kubevirt/kubevirt/pull/11873 that others are now seeing in CI. I'll rebase one final time when that lands.

lyarwood avatar May 09 '24 09:05 lyarwood

/hold cancel

https://github.com/kubevirt/kubevirt/pull/11873 has landed and I've rebased.

lyarwood avatar May 10 '24 09:05 lyarwood

/hold

While I look into the following failure, not sure if it's related to the changes in this PR at first glance:

Tests Suite: [sig-compute][Serial]CPU Hotplug A VM with cpu.maxSockets set higher than cpu.sockets [test_id:10822]should successfully plug guaranteed vCPUs expand_less	2m3s
{ Failure tests/hotplug/cpu.go:189
Timed out after 60.002s.
Expected
    <[]interface {} | len:5, cap:5>: [
        <map[string]interface {} | len:4>{
            "lastProbeTime": nil,
            "lastTransitionTime": <string>"2024-05-10T09:35:08Z",
            "type": <string>"Ready",
            "status": <string>"True",
        },
        <map[string]interface {} | len:4>{
            "type": <string>"LiveMigratable",
            "status": <string>"True",
            "lastProbeTime": nil,
            "lastTransitionTime": nil,
        },
        <map[string]interface {} | len:4>{
            "lastProbeTime": nil,
            "lastTransitionTime": nil,
            "type": <string>"HotVCPUChange",
            "status": <string>"True",
        },
        <map[string]interface {} | len:6>{
            "message": <string>"server error. command SyncVMI failed: \"not enough exclusive threads provided, could not fit 2 core(s)\"",
            "type": <string>"Synchronized",
            "status": <string>"False",
            "lastProbeTime": nil,
            "lastTransitionTime": <string>"2024-05-10T09:35:08Z",
            "reason": <string>"Synchronizing with the Domain failed.",
        },
        <map[string]interface {} | len:4>{
            "type": <string>"AgentConnected",
            "status": <string>"True",
            "lastProbeTime": <string>"2024-05-10T09:35:13Z",
            "lastTransitionTime": nil,
        },
    ]
to not find condition of type 'HotVCPUChange' with status 'True'
tests/hotplug/cpu.go:245}

lyarwood avatar May 10 '24 10:05 lyarwood

/hold

While I look into the following failure, not sure if it's related to the changes in this PR at first glance:

Status update is currently broken, @xpivarc is working on a fix.

jean-edouard avatar May 10 '24 11:05 jean-edouard

/hold cancel /retest-required

lyarwood avatar May 10 '24 19:05 lyarwood

/hold

will need to rebase on https://github.com/kubevirt/kubevirt/pull/11836

lyarwood avatar May 13 '24 08:05 lyarwood

/hold cancel

lyarwood avatar May 13 '24 09:05 lyarwood

/cc @acardace

Rebased on #11655, Antonio if you still have the context loaded feel free to review this series as it's touching the same area as your PR.

lyarwood avatar May 13 '24 15:05 lyarwood

@xpivarc I can't reproduce this locally, smells like another race?

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-unit-test-arm64/1790046996891963392

• [FAILED] [0.114 seconds]
VirtualMachine One valid VirtualMachine controller given The RestartRequired condition [It] should appear when VM doesn't specify maxSockets and sockets go above cluster-wide maxSockets
pkg/virt-controller/watch/vm_test.go:6275
  Timeline >>
  STEP: Creating a VM with CPU sockets set to the cluster maxiumum @ 05/13/24 15:55:43.583
  STEP: Creating a VMI with cluster max @ 05/13/24 15:55:43.595
  STEP: Bumping the VM sockets above the cluster maximum @ 05/13/24 15:55:43.696
  STEP: Executing the controller expecting the RestartRequired condition to appear @ 05/13/24 15:55:43.696
  {"component":"","level":"info","msg":"Updating cluster config from KubeVirt to resource version '985bxnx4xb'","pos":"configuration.go:362","timestamp":"2024-05-13T15:55:43.696413Z"}
  [FAILED] in [It] - pkg/virt-controller/watch/vm_test.go:6298 @ 05/13/24 15:55:43.696
  << Timeline

[FAILED] restart required
  Expected
      <[]v1.VirtualMachineCondition | len:2, cap:2>: [
          {
              Type: "Ready",
              Status: "False",
              LastProbeTime: {
                  Time: 2024-05-13T15:55:43.696507153Z,
              },
              LastTransitionTime: {
                  Time: 2024-05-13T15:55:43.696507153Z,
              },
              Reason: "VMIConditionMissing",
              Message: "VMI is missing the Ready condition",
          },
          {
              Type: "Initialized",
              Status: "True",
              LastProbeTime: {
                  Time: 0001-01-01T00:00:00Z,
              },
              LastTransitionTime: {
                  Time: 0001-01-01T00:00:00Z,
              },
              Reason: "",
              Message: "",
          },
      ]
  to contain element matching
      <*gstruct.FieldsMatcher | 0x4002c2a480>: {
          Fields: {
              "Type": <*matchers.EqualMatcher | 0x4001926ea0>{
                  Expected: <v1.VirtualMachineConditionType>"RestartRequired",
              },
              "Status": <*matchers.EqualMatcher | 0x4001926ec0>{
                  Expected: <v1.ConditionStatus>"True",
              },
          },
          IgnoreExtras: true,
          IgnoreMissing: false,
          failures: [
              <*errors.NestedError | 0x4002882[340](https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-unit-test-arm64/1790046996891963392#1:build-log.txt%3A340)>{
                  Path: ".Type",
                  Err: <*errors.errorString | 0x40019272e0>{
                      s: "Expected\n    <v1.VirtualMachineConditionType>: Initialized\nto equal\n    <v1.VirtualMachineConditionType>: RestartRequired",
                  },
              },
          ],
      }
  In [It] at: pkg/virt-controller/watch/vm_test.go:6298 @ 05/13/24 15:55:43.696

lyarwood avatar May 13 '24 16:05 lyarwood

/hold

lyarwood avatar May 13 '24 16:05 lyarwood

Required labels detected, running phase 2 presubmits: /test pull-kubevirt-e2e-windows2016 /test pull-kubevirt-e2e-kind-sriov /test pull-kubevirt-e2e-k8s-1.30-ipv6-sig-network /test pull-kubevirt-e2e-k8s-1.28-sig-network /test pull-kubevirt-e2e-k8s-1.28-sig-storage /test pull-kubevirt-e2e-k8s-1.28-sig-compute /test pull-kubevirt-e2e-k8s-1.28-sig-operator /test pull-kubevirt-e2e-k8s-1.29-sig-network /test pull-kubevirt-e2e-k8s-1.29-sig-storage /test pull-kubevirt-e2e-k8s-1.29-sig-compute /test pull-kubevirt-e2e-k8s-1.29-sig-operator

kubevirt-commenter-bot avatar May 15 '24 12:05 kubevirt-commenter-bot

New changes are detected. LGTM label has been removed.

kubevirt-bot avatar May 20 '24 12:05 kubevirt-bot

@lyarwood thanks for the rebase! Is the hold still necessary?

jean-edouard avatar May 20 '24 12:05 jean-edouard

@lyarwood thanks for the rebase! Is the hold still necessary?

Yeah I wanted to see if we still needed https://github.com/kubevirt/kubevirt/pull/11904 or https://github.com/kubevirt/kubevirt/pull/11836 and to work with folks on these fixes if we do.

lyarwood avatar May 20 '24 12:05 lyarwood

@lyarwood: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-e2e-k8s-1.27-sig-performance 991035b7ee4e446be50773e92aeaa5b5bcd9824d link true /test pull-kubevirt-e2e-k8s-1.27-sig-performance
pull-kubevirt-e2e-kind-1.27-vgpu a7c77b3f9695c04e39a645778aca7a33c00651a8 link false /test pull-kubevirt-e2e-kind-1.27-vgpu
pull-kubevirt-e2e-k8s-1.29-sig-storage b0c88438f2f7b633ca3cdae8eee9d68df6ce203e link true /test pull-kubevirt-e2e-k8s-1.29-sig-storage
pull-kubevirt-e2e-k8s-1.29-sig-compute b0c88438f2f7b633ca3cdae8eee9d68df6ce203e link true /test pull-kubevirt-e2e-k8s-1.29-sig-compute
pull-kubevirt-e2e-k8s-1.29-sig-operator b0c88438f2f7b633ca3cdae8eee9d68df6ce203e link true /test pull-kubevirt-e2e-k8s-1.29-sig-operator
pull-kubevirt-check-tests-for-flakes ab14750147e4c8a1cee368d8c7f7c53f46a18ca8 link false /test pull-kubevirt-check-tests-for-flakes
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations ab14750147e4c8a1cee368d8c7f7c53f46a18ca8 link true /test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kubevirt-bot avatar May 20 '24 14:05 kubevirt-bot

@lyarwood TODO create a docs update PR once this one got and the DP will be merged

enp0s3 avatar May 28 '24 12:05 enp0s3

The tests fail because of a change introduced in #11962:

  1. The Instancetype matcher is changed here (this PR): https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/tests/hotplug/instancetype.go#L139
  2. This always adds a RestartRequired condition if a matcher was changed (added in #11962) https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/pkg/virt-controller/watch/vm.go#L3030

IIUC this block above already ensures there are only hotpluggable changes, so we should be fine dropping the check introduced in #11962 again.

https://github.com/kubevirt/kubevirt/blob/736662f1b7734c00006bb7da92e28bcc8e503508/pkg/virt-controller/watch/vm.go#L3019-L3028

0xFelix avatar May 28 '24 13:05 0xFelix

Updated to fix a nil-pointer access in handleMemoryHotplugRequest.

See logs for panic: https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations/1795451827265736704/artifacts/k8s-reporter/pods/1_kubevirt_virt-controller-69b7c8c4b8-wpxbx-virt-controller.log

0xFelix avatar May 28 '24 16:05 0xFelix

Missed another nil-pointer access in handleMemoryHotplugRequest, updated again.

See https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations/1795490607427227648/artifacts/k8s-reporter/pods/1_kubevirt_virt-controller-5c6f74b5d5-6ht4x-virt-controller_previous.log

0xFelix avatar May 29 '24 07:05 0xFelix

/retest-required

0xFelix avatar May 29 '24 13:05 0xFelix

/test pull-kubevirt-e2e-k8s-1.30-sig-compute

0xFelix avatar May 29 '24 15:05 0xFelix