origin ETCD-674: WIP: Add E2E test for scaling when an unhealthy member is present

Oct 17 '24 19:10 jubittajohn

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 17 '24 19:10 openshift-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn Once this PR has been reviewed and has the lgtm label, please assign hasbro17 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 17 '24 19:10 openshift-ci[bot]

/test e2e-aws-ovn-etcd-scaling

Oct 18 '24 19:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 21 '24 16:10 jubittajohn

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when a member is unhealthy.This test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

If the CPMS is active, first disable it by deleting the CPMS custom resource.

Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.

Delete the machine hosting the node in step 2.

Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.

Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 21 '24 18:10 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 21 '24 21:10 jubittajohn

Job Failure Risk Analysis for sha: 34733b45a78f25999f732b64199b6aa57b4a58c0

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	IncompleteTests Tests for this run (101) are below the historical average (1155): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6	IncompleteTests Tests for this run (101) are below the historical average (2074): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn	IncompleteTests Tests for this run (101) are below the historical average (2242): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Oct 22 '24 02:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 22 '24 14:10 jubittajohn

Job Failure Risk Analysis for sha: 5c2b0d31440f177b19c309c71e83065e45ec921b

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	IncompleteTests Tests for this run (101) are below the historical average (1064): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6	IncompleteTests Tests for this run (101) are below the historical average (1813): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn	IncompleteTests Tests for this run (101) are below the historical average (2078): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 38.46% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

Oct 22 '24 19:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 22 '24 20:10 jubittajohn

Job Failure Risk Analysis for sha: b5520eeed3d3f83ad07709d8d4b1277cf0871fa1

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 7 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Oct 22 '24 23:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 23 '24 19:10 jubittajohn

Job Failure Risk Analysis for sha: 5d2c0c7ab5eec2b9fbc4b10263ebffe9fe6a983b

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling	High [sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. --- [sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. --- [sig-api-machinery] disruption/kube-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. Open Bugs etcd platform pod exist test failing on etcd-scaling jobs --- [bz-etcd][invariant] alert/etcdMembersDown should not be at or above info This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. Open Bugs etcd-scaling jobs failing ~60% of the time

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling

High
[sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.

pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling

High
[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Open Bugs
etcd platform pod exist test failing on etcd-scaling jobs
---
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Open Bugs
etcd-scaling jobs failing ~60% of the time

Oct 23 '24 23:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 24 '24 19:10 jubittajohn

Job Failure Risk Analysis for sha: da504c15843ba72b884d6f8fcbbc39370b57bf0c

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling	High [sig-api-machinery] disruption/cache-kube-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. --- [sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. --- [sig-api-machinery] disruption/kube-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. --- [sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 7 test results

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling

High
[sig-api-machinery] disruption/cache-kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 7 test results

Oct 24 '24 23:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Oct 31 '24 15:10 jubittajohn

/test e2e-gcp-ovn-etcd-scaling

Nov 04 '24 16:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 04 '24 16:11 jubittajohn

Job Failure Risk Analysis for sha: e2407560f825fd1db05c38fbda5b86dca4056f5e

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [bz-etcd][invariant] alert/etcdMembersDown should not be at or above info This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Nov 04 '24 20:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 12 '24 05:11 jubittajohn

Job Failure Risk Analysis for sha: c85351e18e0473868dca3489296ad9367b01b65a

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

Nov 12 '24 10:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

Nov 12 '24 15:11 jubittajohn

@jubittajohn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-ovn	723212be94976f6923ffb2fed157689ef9b876c6	link	true	`/test e2e-gcp-ovn`
ci/prow/e2e-openstack-ovn	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-gcp-ovn-etcd-scaling	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-aws-ovn-etcd-scaling	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-aws-ovn-single-node-serial	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-azure-ovn-etcd-scaling	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-vsphere-ovn-etcd-scaling	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-agnostic-ovn-cmd	723212be94976f6923ffb2fed157689ef9b876c6	link	false	`/test e2e-agnostic-ovn-cmd`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Nov 12 '24 19:11 openshift-ci[bot]

Job Failure Risk Analysis for sha: 723212be94976f6923ffb2fed157689ef9b876c6

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High

Nov 12 '24 19:11 openshift-trt-bot

Added this test to the PR: https://github.com/openshift/origin/pull/29236

Nov 13 '24 03:11 jubittajohn

origin origin copied to clipboard

ETCD-674: WIP: Add E2E test for scaling when an unhealthy member is present

origin
origin copied to clipboard