ETCD-639: Add E2E test to check if etcd is able to block the rollout of a revision when the quorum is not safe
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether the etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Job Failure Risk Analysis for sha: 2bc10d4b7c2012c2f75ccdecf761b149eddef977
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade | Medium [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] This test has passed 97.66% of 128 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. |
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/test e2e-aws-etcd-recovery
/test e2e-aws-etcd-recovery
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Update:
- The test runs successfully when run locally. However, running the test on PR using
\testit errors out possibly due to incorrect usage ofoc debug. Error:msg: "Error running /usr/bin/oc --kubeconfig=/tmp/kubeconfig-1754906178 debug node/ip-10-0-103-92.ec2.internal -- chroot /host /bin/bash -c mkdir /var/lib/etcd-backup && mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup:\nStdOut>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nStdErr>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nexit status 1\n",
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Update:
- The test runs successfully when run locally. However, running the test on this PR using
/test e2e-aws-etcd-recoveryerrors out possibly due to incorrect usage ofoc debug. Error:msg: "Error running /usr/bin/oc --kubeconfig=/tmp/kubeconfig-1754906178 debug node/ip-10-0-103-92.ec2.internal -- chroot /host /bin/bash -c mkdir /var/lib/etcd-backup && mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup:\nStdOut>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nStdErr>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nexit status 1\n",
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Update:
The test runs successfully when run locally. However, running the test on this PR using
/test e2e-aws-etcd-recoveryerrors out possibly due to incorrect usage ofoc debug. Error:msg: "Error running /usr/bin/oc --kubeconfig=/tmp/kubeconfig-1754906178 debug node/ip-10-0-103-92.ec2.internal -- chroot /host /bin/bash -c mkdir /var/lib/etcd-backup && mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup:\nStdOut>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nStdErr>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nexit status 1\n",Observed when run locally: Only 2 out of the 3 controllers dependant on the quorum guard is going degraded. The
TargetConfigControlleris not going degraded. The status of the controllers were observed in theetcd/cluster. But the logs have entry showing all the three controllers going degraded.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: 67810e49f897f88ea383d9d265fce04ea913dd09
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-etcd-recovery | IncompleteTests Tests for this run (103) are below the historical average (480): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
/test e2e-aws-etcd-recovery
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
- The check for if the
TargetConfigControlleris degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
- The check for if the
TargetConfigControlleris degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309 (This controller related to quorum guard was not going degraded initially when the quorum was lost. After consuming the fix the controller is degraded as expected when the quorum is lost)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Update
- The check for if the
TargetConfigControlleris degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309 (This controller related to quorum guard was not going degraded initially when the quorum was lost. After consuming the fix the controller is degraded as expected when the quorum is lost)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Update:
- The check for if the
TargetConfigControlleris degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309 (This controller related to quorum guard was not going degraded initially when the quorum was lost. After consuming the fix the controller is degraded as expected when the quorum is lost)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: 5760eb87db226fb6be6258dad6f12d44280e559a
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade | High [sig-instrumentation] disruption/metrics-api connection/new should be available throughout the test This test has passed 100.00% of 1010 runs on jobs ['periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade'] in the last 14 days. --- [sig-instrumentation] disruption/metrics-api connection/reused should be available throughout the test This test has passed 100.00% of 1010 runs on jobs ['periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade'] in the last 14 days. |
| pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade | High [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] This test has passed 98.07% of 207 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. --- [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator This test has passed 98.55% of 207 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. |
/test e2e-aws-etcd-recovery
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: 17bf685572b7757e34d93833478e8c02accefafa
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-etcd-recovery | High [sig-node] static pods should start after being created This test has passed 99.51% of 5555 runs on release 4.17 [Overall] in the last week. Open Bugs etcd recovery test has static pod startup failure Static pod controller pods sometimes fail to start |
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: a64240a1993c5a782021d66b314bb26951a0048c
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-etcd-recovery | High [sig-node] static pods should start after being created This test has passed 99.38% of 5761 runs on release 4.17 [Overall] in the last week. Open Bugs etcd recovery test has static pod startup failure Static pod controller pods sometimes fail to start --- [sig-arch] events should not repeat pathologically This test has passed 99.14% of 116 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week. |
| pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade | IncompleteTests Tests for this run (20) are below the historical average (914): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade | IncompleteTests Tests for this run (20) are below the historical average (946): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-gcp-ovn-builds | IncompleteTests Tests for this run (19) are below the historical average (982): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-gcp-ovn | IncompleteTests Tests for this run (19) are below the historical average (2221): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-gcp-csi | IncompleteTests Tests for this run (19) are below the historical average (947): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: 8ef655ed8c28e31c9857767eae14de7011da5adc
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-etcd-recovery | High [bz-Monitoring] clusteroperator/monitoring should not change condition/Available This test has passed 98.13% of 5516 runs on release 4.17 [Overall] in the last week. Open Bugs monitoring ClusterOperator should not blip Available=Unknown on client rate limiter --- [sig-node] static pods should start after being created This test has passed 99.37% of 5521 runs on release 4.17 [Overall] in the last week. Open Bugs etcd recovery test has static pod startup failure Static pod controller pods sometimes fail to start --- [sig-arch] events should not repeat pathologically This test has passed 99.12% of 113 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week. |
| pull-ci-openshift-origin-master-e2e-openstack-ovn | IncompleteTests Tests for this run (17) are below the historical average (2031): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: 1545c66cd0f4526e187d40ed02a725792dfbfaba
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-etcd-recovery | High [sig-node] static pods should start after being created This test has passed 99.33% of 5793 runs on release 4.17 [Overall] in the last week. Open Bugs etcd recovery test has static pod startup failure Static pod controller pods sometimes fail to start |
/test e2e-aws-etcd-recovery
Job Failure Risk Analysis for sha: 47e90aeb07f9226753321fd418562292330e44e8
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-etcd-recovery | High [sig-node] static pods should start after being created This test has passed 99.49% of 4866 runs on release 4.18 [Overall] in the last week. Open Bugs Static pod controller pods sometimes fail to start |
| pull-ci-openshift-origin-master-e2e-aws-ovn-fips | Medium [sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel] This test has passed 93.55% of 31 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-fips'] in the last 14 days. Open Bugs CPU partitioning node test perma-failing |
| pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2 | Medium [sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel] This test has passed 91.43% of 35 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-cgroupsv2'] in the last 14 days. Open Bugs CPU partitioning node test perma-failing |
| pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout | Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 22 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days. |
/lgtm
@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.
In response to this:
This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.
The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.