origin icon indicating copy to clipboard operation
origin copied to clipboard

ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present

Open jubittajohn opened this issue 1 year ago • 50 comments

The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.

First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

  1. If the CPMS is active, first disable it by deleting the CPMS custom resource.
  2. Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
  3. Delete the machine hosting the node in step 2.
  4. Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
  5. Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.

The second test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199 CPMS should be active for this test scenario

  1. Stop the kubelet on a node.
  2. Delete the machine hosting the node in step 2.
  3. That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.
  4. The operator will first scale-up the new machine's member.
  5. Then scale-down the machine that is pending deletion by removing its member and deletion hook.

jubittajohn avatar Oct 25 '24 20:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 25 '24 20:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 25 '24 20:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 31 '24 15:10 jubittajohn

Job Failure Risk Analysis for sha: 78840bade248d749462aa7c407b9b3a3df499b4d

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling High
[sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io]
This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 14 test results
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility cleanup
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[Jira:"Node / Kubelet"] monitor test kubelet-log-collector collection
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 13 test results

openshift-trt-bot avatar Oct 31 '24 20:10 openshift-trt-bot

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 04 '24 16:11 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 04 '24 17:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 08 '24 05:11 jubittajohn

Job Failure Risk Analysis for sha: 7c1093bb2e6b0d6d36473460aee439084c3a56a7

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling High
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

openshift-trt-bot avatar Nov 08 '24 10:11 openshift-trt-bot

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled.

CPMS should be active for this test scenario

  1. Stop the kubelet on a node
  2. Delete the machine hosting the node in step 2.
  3. That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index
  4. The operator will first scale-up the new machine's member
  5. Then scale-down the machine that is pending deletion by removing its member and deletion hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 12 '24 15:11 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 12 '24 16:11 jubittajohn

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199

CPMS should be active for this test scenario

  1. Stop the kubelet on a node
  2. Delete the machine hosting the node in step 2.
  3. That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index
  4. The operator will first scale-up the new machine's member
  5. Then scale-down the machine that is pending deletion by removing its member and deletion hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 12 '24 17:11 openshift-ci-robot

Job Failure Risk Analysis for sha: da81f2aa001b2032b398233685088c00346157f2

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling High
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (101) are below the historical average (2543): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Nov 12 '24 20:11 openshift-trt-bot

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.

First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

  1. If the CPMS is active, first disable it by deleting the CPMS custom resource.
  2. Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
  3. Delete the machine hosting the node in step 2.
  4. Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
  5. Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.

The second test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199 CPMS should be active for this test scenario

  1. Stop the kubelet on a node.
  2. Delete the machine hosting the node in step 2.
  3. That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.
  4. The operator will first scale-up the new machine's member.
  5. Then scale-down the machine that is pending deletion by removing its member and deletion hook.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 13 '24 03:11 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 13 '24 03:11 jubittajohn

/lgtm

/hold

tjungblu avatar Nov 13 '24 12:11 tjungblu

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

tjungblu avatar Nov 13 '24 12:11 tjungblu

Job Failure Risk Analysis for sha: 03ccda6aa70fe9dd9fcc3b959a237fcf602103a8

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling High

openshift-trt-bot avatar Nov 13 '24 16:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 14 '24 20:11 jubittajohn

/test e2e-gcp-ovn-etcd-scaling /test e2e-aws-ovn-edge-zones

jubittajohn avatar Nov 15 '24 06:11 jubittajohn

/test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 15 '24 14:11 jubittajohn

Job Failure Risk Analysis for sha: df7330d85deaf4180e3cd1fd328d7c0f9371bbd8

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling High
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling High
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.

openshift-trt-bot avatar Nov 15 '24 19:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 18 '24 16:11 jubittajohn

/lgtm

tjungblu avatar Nov 18 '24 16:11 tjungblu

/test e2e-azure-ovn-etcd-scaling

jubittajohn avatar Nov 18 '24 20:11 jubittajohn

/test e2e-azure-ovn-etcd-scaling

jubittajohn avatar Nov 19 '24 19:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling

jubittajohn avatar Nov 20 '24 15:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 25 '24 16:11 jubittajohn

/test e2e-azure-ovn-etcd-scaling

jubittajohn avatar Nov 25 '24 21:11 jubittajohn

/test e2e-azure-ovn-etcd-scaling

jubittajohn avatar Nov 26 '24 16:11 jubittajohn

Job Failure Risk Analysis for sha: 1f87a9e45ef9a396b253dc222b050bd6a1aac22f

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn High
[sig-arch] Only known images used by tests
This test has passed 100.00% of 18 runs on jobs ['periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling IncompleteTests
Tests for this run (106) are below the historical average (984): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial IncompleteTests
Tests for this run (26) are below the historical average (1350): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 76.19% of 21 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

Open Bugs
Component Readiness: operators should not create watch channels very often

openshift-trt[bot] avatar Nov 26 '24 21:11 openshift-trt[bot]