ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present
The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.
First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.
- If the CPMS is active, first disable it by deleting the CPMS custom resource.
- Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
- Delete the machine hosting the node in step 2.
- Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
- Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.
The second test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199 CPMS should be active for this test scenario
- Stop the kubelet on a node.
- Delete the machine hosting the node in step 2.
- That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.
- The operator will first scale-up the new machine's member.
- Then scale-down the machine that is pending deletion by removing its member and deletion hook.
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
Job Failure Risk Analysis for sha: 78840bade248d749462aa7c407b9b3a3df499b4d
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling | High [sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io] This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 14 test results |
| pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling | High [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility cleanup This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Node / Kubelet"] monitor test kubelet-log-collector collection This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 13 test results |
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
Job Failure Risk Analysis for sha: 7c1093bb2e6b0d6d36473460aee439084c3a56a7
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling | High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. |
| pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling | High [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended [Suite:openshift/conformance/parallel] This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. |
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.
In response to this:
The following test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled.
CPMS should be active for this test scenario
- Stop the kubelet on a node
- Delete the machine hosting the node in step 2.
- That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index
- The operator will first scale-up the new machine's member
- Then scale-down the machine that is pending deletion by removing its member and deletion hook
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.
In response to this:
The following test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199
CPMS should be active for this test scenario
- Stop the kubelet on a node
- Delete the machine hosting the node in step 2.
- That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index
- The operator will first scale-up the new machine's member
- Then scale-down the machine that is pending deletion by removing its member and deletion hook
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Job Failure Risk Analysis for sha: da81f2aa001b2032b398233685088c00346157f2
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling | High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. |
| pull-ci-openshift-origin-master-e2e-metal-ipi-ovn | IncompleteTests Tests for this run (101) are below the historical average (2543): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout | Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days. |
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.
In response to this:
The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.
First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.
- If the CPMS is active, first disable it by deleting the CPMS custom resource.
- Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
- Delete the machine hosting the node in step 2.
- Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
- Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.
The second test covers a vertical scaling scenario when kubelet is not working on a node. This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199 CPMS should be active for this test scenario
- Stop the kubelet on a node.
- Delete the machine hosting the node in step 2.
- That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.
- The operator will first scale-up the new machine's member.
- Then scale-down the machine that is pending deletion by removing its member and deletion hook.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/lgtm
/hold
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
Job Failure Risk Analysis for sha: 03ccda6aa70fe9dd9fcc3b959a237fcf602103a8
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling | High |
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling /test e2e-aws-ovn-edge-zones
/test e2e-vsphere-ovn-etcd-scaling
Job Failure Risk Analysis for sha: df7330d85deaf4180e3cd1fd328d7c0f9371bbd8
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling | High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling'] in the last 14 days. |
| pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling | High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. |
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/lgtm
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
Job Failure Risk Analysis for sha: 1f87a9e45ef9a396b253dc222b050bd6a1aac22f
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn | High [sig-arch] Only known images used by tests This test has passed 100.00% of 18 runs on jobs ['periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn'] in the last 14 days. |
| pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling | High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling'] in the last 14 days. |
| pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling | IncompleteTests Tests for this run (106) are below the historical average (984): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial | IncompleteTests Tests for this run (26) are below the historical average (1350): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout | Low [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] This test has passed 76.19% of 21 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days. Open Bugs Component Readiness: operators should not create watch channels very often |