origin icon indicating copy to clipboard operation
origin copied to clipboard

OCPBUGS-38859: add a test (that flakes) to detect faulty load balancer

Open tkashem opened this issue 1 year ago • 10 comments

tkashem avatar Aug 26 '24 01:08 tkashem

Job Failure Risk Analysis for sha: 8afd3c8e370fb321e5f320d78569d51f6e9e3b56

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
operator conditions kube-apiserver
This test has passed 68.75% of 16 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.
---
[sig-sippy] tests should finish with healthy operators
This test has passed 68.75% of 16 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Aug 26 '24 17:08 openshift-trt-bot

Job Failure Risk Analysis for sha: f5ea35e2f6aaeed00ae535a6f64f2effd468805c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
[sig-sippy] tests should finish with healthy operators
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.
---
operator conditions kube-apiserver
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Aug 26 '24 22:08 openshift-trt-bot

@tkashem: This pull request references Jira Issue OCPBUGS-38859, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 27 '24 15:08 openshift-ci-robot

/label acknowledge-critical-fixes-only

(it does not fail yet, it flakes only so we can measure and fix, once the fixes are made, we can change it to a test that fails)

tkashem avatar Aug 27 '24 15:08 tkashem

/lgtm

I would just check that you can find passes and fails in the rehersals once they're in, but it looks good now.

dgoodwin avatar Aug 27 '24 17:08 dgoodwin

/hold (until we see some passes in rehearsals)

tkashem avatar Aug 27 '24 18:08 tkashem

/retest

tkashem avatar Aug 27 '24 19:08 tkashem

Job Failure Risk Analysis for sha: 8e36b1b0f1b685a9ccfd5b1dfeca6089d1725d0c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
[sig-sippy] tests should finish with healthy operators
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.
---
operator conditions kube-apiserver
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Aug 28 '24 00:08 openshift-trt-bot

/test e2e-metal-ipi-ovn-kube-apiserver-rollout

tkashem avatar Aug 28 '24 19:08 tkashem

Passed: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29034/pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout/1828905534703538176

image There is one client error interval, but it does not overlap with any kube-apisererver shutdown interval

juni test output under "Tests Passed":

: [sig-apimachinery] new and reused connections to kube-apiserver should be handled gracefully during the graceful termination process

Skipped: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29034/pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2/1828905495277080576 There are no kube-apiserver shutdown interval, and test log says

I0828 23:15:25.753028 309 monitortest.go:70] monitor[faulty-load-balancer]: found 0 interesting intervals, kube-apiserver shutdown interval count: 0

junit test ouput:

: [sig-apimachinery] new and reused connections to kube-apiserver should be handled gracefully during the graceful termination process

Reason: No kube-apiserver shutdown interval found

Flake: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29034/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1828905505343410176

image

monitor tests log:

I0829 00:35:50.831077 296 monitortest.go:70] monitor[faulty-load-balancer]: found 29 interesting intervals, kube-apiserver shutdown interval count: 14

junit output:

: [sig-apimachinery] new and reused connections to kube-apiserver should be handled gracefully during the graceful termination process 
Run #0: Failed 0s
{  
client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:34:42.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:35:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:38:39.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:39:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:42:34.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:43:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:47:15.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:47:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:51:15.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:51:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:55:07.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:55:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:59:51.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:00:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:03:42.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:04:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:07:40.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:08:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:12:33.000 - 132s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:13:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:16:32.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:17:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:20:27.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:21:21.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:25:19.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:26:21.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:29:21.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:30:21.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s
}
Run #1: Passed 

tkashem avatar Aug 29 '24 12:08 tkashem

/hold cancel

tkashem avatar Aug 29 '24 13:08 tkashem

/retest-required

tkashem avatar Aug 29 '24 13:08 tkashem

/lgtm

dgoodwin avatar Aug 29 '24 14:08 dgoodwin

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, sanchezl, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Aug 29 '24 14:08 openshift-ci[bot]

/retest-required

Remaining retests: 0 against base HEAD 458e1eae65b16ca8d4f12f387ec2ae34a9ba7591 and 2 for PR HEAD cbc62c55589992d00643455f45eebf2fe798bd8f in total

openshift-ci-robot avatar Aug 29 '24 15:08 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD a993c78e79f552ce8b6f5ff4c6f66ae9fbf8a0d4 and 2 for PR HEAD cbc62c55589992d00643455f45eebf2fe798bd8f in total

openshift-ci-robot avatar Aug 29 '24 17:08 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD a993c78e79f552ce8b6f5ff4c6f66ae9fbf8a0d4 and 2 for PR HEAD cbc62c55589992d00643455f45eebf2fe798bd8f in total

openshift-ci-robot avatar Aug 29 '24 22:08 openshift-ci-robot

/retest-required

tkashem avatar Aug 30 '24 01:08 tkashem

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node cbc62c55589992d00643455f45eebf2fe798bd8f link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-ipsec-serial cbc62c55589992d00643455f45eebf2fe798bd8f link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-aws-ovn-single-node-upgrade cbc62c55589992d00643455f45eebf2fe798bd8f link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-ovn-rt-upgrade cbc62c55589992d00643455f45eebf2fe798bd8f link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-upgrade cbc62c55589992d00643455f45eebf2fe798bd8f link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-cgroupsv2 cbc62c55589992d00643455f45eebf2fe798bd8f link false /test e2e-aws-ovn-cgroupsv2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Aug 30 '24 05:08 openshift-ci[bot]

Job Failure Risk Analysis for sha: cbc62c55589992d00643455f45eebf2fe798bd8f

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial High
[sig-arch] events should not repeat pathologically for ns/openshift-authentication-operator
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[bz-Monitoring] clusteroperator/monitoring should not change condition/Available
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
monitoring ClusterOperator should not blip Available=Unknown on client rate limiter

openshift-trt-bot avatar Aug 30 '24 06:08 openshift-trt-bot

/retest-required

Remaining retests: 0 against base HEAD 8d619a5336c57e4f51efb731a5406efe95f52c1c and 2 for PR HEAD cbc62c55589992d00643455f45eebf2fe798bd8f in total

openshift-ci-robot avatar Aug 30 '24 06:08 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD f1ade5751b9d643a9cc1c61e40046cadcd45dd94 and 2 for PR HEAD cbc62c55589992d00643455f45eebf2fe798bd8f in total

openshift-ci-robot avatar Aug 30 '24 12:08 openshift-ci-robot

@tkashem: Jira Issue OCPBUGS-38859: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38859 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 30 '24 14:08 openshift-ci-robot

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests This PR has been included in build openshift-enterprise-tests-container-v4.18.0-202408301641.p0.g1ce76da.assembly.stream.el9. All builds following this will include this PR.

openshift-bot avatar Aug 30 '24 17:08 openshift-bot