origin icon indicating copy to clipboard operation
origin copied to clipboard

ETCD-639: Add E2E test to check if etcd is able to block the rollout of a revision when the quorum is not safe

Open jubittajohn opened this issue 1 year ago • 41 comments

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

jubittajohn avatar Jul 29 '24 15:07 jubittajohn

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether the etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 29 '24 17:07 openshift-ci-robot

Job Failure Risk Analysis for sha: 2bc10d4b7c2012c2f75ccdecf761b149eddef977

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Medium
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 97.66% of 128 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Jul 30 '24 08:07 openshift-trt-bot

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 30 '24 13:07 openshift-ci-robot

/test e2e-aws-etcd-recovery

jubittajohn avatar Jul 30 '24 20:07 jubittajohn

/test e2e-aws-etcd-recovery

jubittajohn avatar Jul 31 '24 04:07 jubittajohn

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Update:

  • The test runs successfully when run locally. However, running the test on PR using \test it errors out possibly due to incorrect usage of oc debug. Error:
    msg: "Error running /usr/bin/oc --kubeconfig=/tmp/kubeconfig-1754906178 debug node/ip-10-0-103-92.ec2.internal -- chroot /host /bin/bash -c mkdir /var/lib/etcd-backup && mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup:\nStdOut>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nStdErr>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nexit status 1\n",
    

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 31 '24 05:07 openshift-ci-robot

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Update:

  • The test runs successfully when run locally. However, running the test on this PR using /test e2e-aws-etcd-recovery errors out possibly due to incorrect usage of oc debug. Error:
    msg: "Error running /usr/bin/oc --kubeconfig=/tmp/kubeconfig-1754906178 debug node/ip-10-0-103-92.ec2.internal -- chroot /host /bin/bash -c mkdir /var/lib/etcd-backup && mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup:\nStdOut>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nStdErr>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nexit status 1\n",
    

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 31 '24 05:07 openshift-ci-robot

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Update:

  • The test runs successfully when run locally. However, running the test on this PR using /test e2e-aws-etcd-recovery errors out possibly due to incorrect usage of oc debug. Error:

    msg: "Error running /usr/bin/oc --kubeconfig=/tmp/kubeconfig-1754906178 debug node/ip-10-0-103-92.ec2.internal -- chroot /host /bin/bash -c mkdir /var/lib/etcd-backup && mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup:\nStdOut>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nStdErr>\nerror: unable to get namespace namespaces \"ci-op-w9pisriv\" not found\nexit status 1\n",
    
  • Observed when run locally: Only 2 out of the 3 controllers dependant on the quorum guard is going degraded. The TargetConfigController is not going degraded. The status of the controllers were observed in the etcd/cluster. But the logs have entry showing all the three controllers going degraded.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 31 '24 05:07 openshift-ci-robot

/test e2e-aws-etcd-recovery

jubittajohn avatar Jul 31 '24 13:07 jubittajohn

Job Failure Risk Analysis for sha: 67810e49f897f88ea383d9d265fce04ea913dd09

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery IncompleteTests
Tests for this run (103) are below the historical average (480): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Jul 31 '24 16:07 openshift-trt-bot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 01 '24 03:08 jubittajohn

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

  • The check for if the TargetConfigController is degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 01 '24 03:08 openshift-ci-robot

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

  • The check for if the TargetConfigController is degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309 (This controller related to quorum guard was not going degraded initially when the quorum was lost. After consuming the fix the controller is degraded as expected when the quorum is lost)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 01 '24 04:08 openshift-ci-robot

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Update

  • The check for if the TargetConfigController is degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309 (This controller related to quorum guard was not going degraded initially when the quorum was lost. After consuming the fix the controller is degraded as expected when the quorum is lost)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 01 '24 04:08 openshift-ci-robot

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Update:

  • The check for if the TargetConfigController is degraded is commented at the moment. Can be uncommented after merging the changes in openshift/cluster-etcd-operator#1309 (This controller related to quorum guard was not going degraded initially when the quorum was lost. After consuming the fix the controller is degraded as expected when the quorum is lost)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 01 '24 04:08 openshift-ci-robot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 01 '24 05:08 jubittajohn

Job Failure Risk Analysis for sha: 5760eb87db226fb6be6258dad6f12d44280e559a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-instrumentation] disruption/metrics-api connection/new should be available throughout the test
This test has passed 100.00% of 1010 runs on jobs ['periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade'] in the last 14 days.
---
[sig-instrumentation] disruption/metrics-api connection/reused should be available throughout the test
This test has passed 100.00% of 1010 runs on jobs ['periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 98.07% of 207 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.
---
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 98.55% of 207 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Aug 01 '24 08:08 openshift-trt-bot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 01 '24 19:08 jubittajohn

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 02 '24 14:08 jubittajohn

Job Failure Risk Analysis for sha: 17bf685572b7757e34d93833478e8c02accefafa

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery High
[sig-node] static pods should start after being created
This test has passed 99.51% of 5555 runs on release 4.17 [Overall] in the last week.

Open Bugs
etcd recovery test has static pod startup failure
Static pod controller pods sometimes fail to start

openshift-trt-bot avatar Aug 02 '24 17:08 openshift-trt-bot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 04 '24 22:08 jubittajohn

Job Failure Risk Analysis for sha: a64240a1993c5a782021d66b314bb26951a0048c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery High
[sig-node] static pods should start after being created
This test has passed 99.38% of 5761 runs on release 4.17 [Overall] in the last week.

Open Bugs
etcd recovery test has static pod startup failure
Static pod controller pods sometimes fail to start
---
[sig-arch] events should not repeat pathologically
This test has passed 99.14% of 116 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade IncompleteTests
Tests for this run (20) are below the historical average (914): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (20) are below the historical average (946): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn-builds IncompleteTests
Tests for this run (19) are below the historical average (982): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn IncompleteTests
Tests for this run (19) are below the historical average (2221): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-csi IncompleteTests
Tests for this run (19) are below the historical average (947): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Aug 05 '24 02:08 openshift-trt-bot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 05 '24 04:08 jubittajohn

Job Failure Risk Analysis for sha: 8ef655ed8c28e31c9857767eae14de7011da5adc

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery High
[bz-Monitoring] clusteroperator/monitoring should not change condition/Available
This test has passed 98.13% of 5516 runs on release 4.17 [Overall] in the last week.

Open Bugs
monitoring ClusterOperator should not blip Available=Unknown on client rate limiter
---
[sig-node] static pods should start after being created
This test has passed 99.37% of 5521 runs on release 4.17 [Overall] in the last week.

Open Bugs
etcd recovery test has static pod startup failure
Static pod controller pods sometimes fail to start
---
[sig-arch] events should not repeat pathologically
This test has passed 99.12% of 113 runs on release 4.17 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (17) are below the historical average (2031): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Aug 05 '24 08:08 openshift-trt-bot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 06 '24 02:08 jubittajohn

Job Failure Risk Analysis for sha: 1545c66cd0f4526e187d40ed02a725792dfbfaba

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery High
[sig-node] static pods should start after being created
This test has passed 99.33% of 5793 runs on release 4.17 [Overall] in the last week.

Open Bugs
etcd recovery test has static pod startup failure
Static pod controller pods sometimes fail to start

openshift-trt-bot avatar Aug 06 '24 06:08 openshift-trt-bot

/test e2e-aws-etcd-recovery

jubittajohn avatar Aug 21 '24 20:08 jubittajohn

Job Failure Risk Analysis for sha: 47e90aeb07f9226753321fd418562292330e44e8

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery High
[sig-node] static pods should start after being created
This test has passed 99.49% of 4866 runs on release 4.18 [Overall] in the last week.

Open Bugs
Static pod controller pods sometimes fail to start
pull-ci-openshift-origin-master-e2e-aws-ovn-fips Medium
[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]
This test has passed 93.55% of 31 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-fips'] in the last 14 days.

Open Bugs
CPU partitioning node test perma-failing
pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2 Medium
[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]
This test has passed 91.43% of 35 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-cgroupsv2'] in the last 14 days.

Open Bugs
CPU partitioning node test perma-failing
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 50.00% of 22 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Aug 22 '24 00:08 openshift-trt-bot

/lgtm

tjungblu avatar Aug 22 '24 15:08 tjungblu

@jubittajohn: This pull request references ETCD-639 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

This E2E tests whether etcd is able to block the rollout of a new revision when the quorum is not safe.

The etcd static pod manifest is removed by debugging into the node to bring down an etcd instance(to simulate insufficient quorum)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 22 '24 16:08 openshift-ci-robot