origin ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present

The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.

First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

If the CPMS is active, first disable it by deleting the CPMS custom resource.
Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
Delete the machine hosting the node in step 2.
Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.

The second test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199 CPMS should be active for this test scenario

Stop the kubelet on a node.
Delete the machine hosting the node in step 2.
That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.
The operator will first scale-up the new machine's member.
Then scale-down the machine that is pending deletion by removing its member and deletion hook.

Oct 25 '24 20:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 25 '24 20:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 25 '24 20:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 31 '24 15:10 jubittajohn

Job Failure Risk Analysis for sha: 78840bade248d749462aa7c407b9b3a3df499b4d

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	High [sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io] This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 14 test results
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility cleanup This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Node / Kubelet"] monitor test kubelet-log-collector collection This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 13 test results

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling

High
[sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io]
This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 14 test results

pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling

High
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility cleanup
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[Jira:"Node / Kubelet"] monitor test kubelet-log-collector collection
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 13 test results

Oct 31 '24 20:10 openshift-trt-bot

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 04 '24 16:11 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 04 '24 17:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 08 '24 05:11 jubittajohn

Job Failure Risk Analysis for sha: 7c1093bb2e6b0d6d36473460aee439084c3a56a7

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended [Suite:openshift/conformance/parallel] This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling

High
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
---
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.

pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling

High
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
---
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Nov 08 '24 10:11 openshift-trt-bot

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled.

CPMS should be active for this test scenario

Stop the kubelet on a node

Delete the machine hosting the node in step 2.

That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index

The operator will first scale-up the new machine's member

Then scale-down the machine that is pending deletion by removing its member and deletion hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 12 '24 15:11 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 12 '24 16:11 jubittajohn

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199

CPMS should be active for this test scenario

Stop the kubelet on a node

Delete the machine hosting the node in step 2.

That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index

The operator will first scale-up the new machine's member

Then scale-down the machine that is pending deletion by removing its member and deletion hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 12 '24 17:11 openshift-ci-robot

Job Failure Risk Analysis for sha: da81f2aa001b2032b398233685088c00346157f2

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn	IncompleteTests Tests for this run (101) are below the historical average (2543): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

Nov 12 '24 20:11 openshift-trt-bot

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.

First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

If the CPMS is active, first disable it by deleting the CPMS custom resource.

Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.

Delete the machine hosting the node in step 2.

Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.

Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.

The second test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199 CPMS should be active for this test scenario

Stop the kubelet on a node.

Delete the machine hosting the node in step 2.

That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.

The operator will first scale-up the new machine's member.

Then scale-down the machine that is pending deletion by removing its member and deletion hook.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 13 '24 03:11 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 13 '24 03:11 jubittajohn

/lgtm

/hold

Nov 13 '24 12:11 tjungblu

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 13 '24 12:11 tjungblu

Job Failure Risk Analysis for sha: 03ccda6aa70fe9dd9fcc3b959a237fcf602103a8

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High

Nov 13 '24 16:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 14 '24 20:11 jubittajohn

/test e2e-gcp-ovn-etcd-scaling /test e2e-aws-ovn-edge-zones

Nov 15 '24 06:11 jubittajohn

/test e2e-vsphere-ovn-etcd-scaling

Nov 15 '24 14:11 jubittajohn

Job Failure Risk Analysis for sha: df7330d85deaf4180e3cd1fd328d7c0f9371bbd8

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.

Nov 15 '24 19:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 18 '24 16:11 jubittajohn

/lgtm

Nov 18 '24 16:11 tjungblu

/test e2e-azure-ovn-etcd-scaling

Nov 18 '24 20:11 jubittajohn

/test e2e-azure-ovn-etcd-scaling

Nov 19 '24 19:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling

Nov 20 '24 15:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 25 '24 16:11 jubittajohn

/test e2e-azure-ovn-etcd-scaling

Nov 25 '24 21:11 jubittajohn

/test e2e-azure-ovn-etcd-scaling

Nov 26 '24 16:11 jubittajohn

Job Failure Risk Analysis for sha: 1f87a9e45ef9a396b253dc222b050bd6a1aac22f

Job Name	Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn	High [sig-arch] Only known images used by tests This test has passed 100.00% of 18 runs on jobs ['periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	IncompleteTests Tests for this run (106) are below the historical average (984): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	IncompleteTests Tests for this run (26) are below the historical average (1350): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	Low [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] This test has passed 76.19% of 21 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days. Open Bugs Component Readiness: operators should not create watch channels very often

Nov 26 '24 21:11 openshift-trt[bot]