origin icon indicating copy to clipboard operation
origin copied to clipboard

ETCD-674: WIP: Add E2E test for scaling when an unhealthy member is present

Open jubittajohn opened this issue 1 year ago • 4 comments

jubittajohn avatar Oct 17 '24 19:10 jubittajohn

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 17 '24 19:10 openshift-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn Once this PR has been reviewed and has the lgtm label, please assign hasbro17 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Oct 17 '24 19:10 openshift-ci[bot]

/test e2e-aws-ovn-etcd-scaling

jubittajohn avatar Oct 18 '24 19:10 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 21 '24 16:10 jubittajohn

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when a member is unhealthy.This test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

  1. If the CPMS is active, first disable it by deleting the CPMS custom resource.
  2. Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
  3. Delete the machine hosting the node in step 2.
  4. Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
  5. Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 21 '24 18:10 openshift-ci-robot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 21 '24 21:10 jubittajohn

Job Failure Risk Analysis for sha: 34733b45a78f25999f732b64199b6aa57b4a58c0

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (101) are below the historical average (1155): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (101) are below the historical average (2074): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (101) are below the historical average (2242): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Oct 22 '24 02:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 22 '24 14:10 jubittajohn

Job Failure Risk Analysis for sha: 5c2b0d31440f177b19c309c71e83065e45ec921b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (101) are below the historical average (1064): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (101) are below the historical average (1813): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (101) are below the historical average (2078): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 38.46% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Oct 22 '24 19:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 22 '24 20:10 jubittajohn

Job Failure Risk Analysis for sha: b5520eeed3d3f83ad07709d8d4b1277cf0871fa1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling High
[sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 7 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

openshift-trt-bot avatar Oct 22 '24 23:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 23 '24 19:10 jubittajohn

Job Failure Risk Analysis for sha: 5d2c0c7ab5eec2b9fbc4b10263ebffe9fe6a983b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling High
[sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Open Bugs
etcd platform pod exist test failing on etcd-scaling jobs
---
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Open Bugs
etcd-scaling jobs failing ~60% of the time

openshift-trt-bot avatar Oct 23 '24 23:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 24 '24 19:10 jubittajohn

Job Failure Risk Analysis for sha: da504c15843ba72b884d6f8fcbbc39370b57bf0c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling High
[sig-api-machinery] disruption/cache-kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 7 test results

openshift-trt-bot avatar Oct 24 '24 23:10 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Oct 31 '24 15:10 jubittajohn

/test e2e-gcp-ovn-etcd-scaling

jubittajohn avatar Nov 04 '24 16:11 jubittajohn

/test e2e-aws-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 04 '24 16:11 jubittajohn

Job Failure Risk Analysis for sha: e2407560f825fd1db05c38fbda5b86dca4056f5e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

openshift-trt-bot avatar Nov 04 '24 20:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 12 '24 05:11 jubittajohn

Job Failure Risk Analysis for sha: c85351e18e0473868dca3489296ad9367b01b65a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Nov 12 '24 10:11 openshift-trt-bot

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn avatar Nov 12 '24 15:11 jubittajohn

@jubittajohn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn 723212be94976f6923ffb2fed157689ef9b876c6 link true /test e2e-gcp-ovn
ci/prow/e2e-openstack-ovn 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-openstack-ovn
ci/prow/e2e-gcp-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-single-node-serial 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-azure-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-vsphere-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-agnostic-ovn-cmd 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-agnostic-ovn-cmd

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Nov 12 '24 19:11 openshift-ci[bot]

Job Failure Risk Analysis for sha: 723212be94976f6923ffb2fed157689ef9b876c6

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High

openshift-trt-bot avatar Nov 12 '24 19:11 openshift-trt-bot

Added this test to the PR: https://github.com/openshift/origin/pull/29236

jubittajohn avatar Nov 13 '24 03:11 jubittajohn