origin icon indicating copy to clipboard operation
origin copied to clipboard

OCPBUGS-38388: Fail on FailedToLease events for kubelet log collector

Open kannon92 opened this issue 1 year ago • 27 comments

kannon92 avatar Aug 12 '24 23:08 kannon92

/jira refresh

kannon92 avatar Aug 13 '24 00:08 kannon92

@kannon92: No Jira issue is referenced in the title of this pull request. To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 13 '24 00:08 openshift-ci-robot

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 13 '24 00:08 openshift-ci-robot

/jira refresh

kannon92 avatar Aug 13 '24 00:08 kannon92

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 13 '24 00:08 openshift-ci-robot

/jira refresh

kannon92 avatar Aug 13 '24 00:08 kannon92

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is invalid:

  • expected the bug to target only the "4.18.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 13 '24 00:08 openshift-ci-robot

/jira refresh

kannon92 avatar Aug 13 '24 00:08 kannon92

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 13 '24 00:08 openshift-ci-robot

Job Failure Risk Analysis for sha: ea11c108b00d599951d0f7376d1937d069bbd0e8

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial IncompleteTests
Tests for this run (21) are below the historical average (462): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-csi Medium
[sig-network] can collect pod-to-host poller pod logs
This test has passed 94.74% of 19 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi'] in the last 14 days.

Open Bugs
collecting poller pod logs failing in e2e-vsphere-ovn jobs
---
[sig-network] can collect host-to-host poller pod logs
This test has passed 94.74% of 19 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi'] in the last 14 days.

openshift-trt-bot avatar Aug 15 '24 00:08 openshift-trt-bot

/retest

kannon92 avatar Aug 15 '24 01:08 kannon92

Job Failure Risk Analysis for sha: 4bb45c498addf50a9ad905ab07b9029758de3aa7

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones IncompleteTests
Tests for this run (101) are below the historical average (1558): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Aug 15 '24 06:08 openshift-trt-bot

/retest

kannon92 avatar Aug 15 '24 12:08 kannon92

Job Failure Risk Analysis for sha: e2ef4d0e67e859525efdf432285caa2c6ef76c80

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial High
[bz-Management Console] clusteroperator/console should not change condition/Available
This test has passed 100.00% of 36 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
[bz-Management Console] clusteroperator/console should not change condition/Available

openshift-trt-bot avatar Aug 15 '24 18:08 openshift-trt-bot

/retest

kannon92 avatar Aug 16 '24 02:08 kannon92

Job Failure Risk Analysis for sha: 304b1ebf85f7f6017de20c184650dc87cb3657e5

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 99.17% of 121 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Aug 16 '24 05:08 openshift-trt-bot

Job Failure Risk Analysis for sha: 4bfbeeaa493cb1c8de763b7448434e1d7ccca321

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-arch] events should not repeat pathologically for ns/openshift-machine-api
This test has passed 99.86% of 720 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-serial High
[sig-network-edge][Feature:Idling] Unidling with Deployments [apigroup:route.openshift.io] should handle many TCP connections by possibly dropping those over a certain bound [Serial] [Suite:openshift/conformance/serial]
This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Medium
[sig-network-edge] Verify DNS availability during and after upgrade success
This test has passed 94.77% of 172 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Aug 19 '24 18:08 openshift-trt-bot

/retest-required

kannon92 avatar Aug 19 '24 18:08 kannon92

Looks like the test isn't de-duping as expected. Have a look at https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/28999/pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial/1825557058091487232 . Duplicate instances of the same lease failure? Logic error in display?

deads2k avatar Aug 20 '24 16:08 deads2k

very neat

Aug 19 17:30:23.156577 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.156522 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.54.68:6443: connect: connection refused" Aug 19 17:30:23.159659 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.159621 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.87.235:6443: connect: connection refused" Aug 19 17:30:23.163067 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.163032 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.54.68:6443: connect: connection refused" Aug 19 17:30:23.167030 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.166982 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.87.235:6443: connect: connection refused" Aug 19 17:30:23.170620 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.170584 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.54.68:6443: connect: connection refused" Aug 19 17:30:23.170756 ip-10-0-77-156 kubenswrapper[2478]: I0819 17:30:23.170623 2478 controller.go:115] "failed to update lease using latest lease, fallback to ensure lease" err="failed 5 attempts to update lease"

from the log. Backoff logic is interesting, five rapids, then a sleep. Perhaps we're looking for Failed to ensure lease exists instead?

deads2k avatar Aug 20 '24 16:08 deads2k

/hold

Going to look at the logs tomorrow to see if this works.

kannon92 avatar Aug 20 '24 21:08 kannon92

/retest

kannon92 avatar Aug 21 '24 11:08 kannon92

Job Failure Risk Analysis for sha: e805d53c4dd605c0c47e7297394c8b43b38907ec

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (20) are below the historical average (717): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (20) are below the historical average (1830): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (20) are below the historical average (2020): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Aug 21 '24 14:08 openshift-trt-bot

Job Failure Risk Analysis for sha: 1558beaef3bd8ce46cccd05b0ee32490736cd9dc

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Medium
[sig-network] pods should successfully create sandboxes by adding pod to network
This test has passed 80.85% of 141 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
s390x: [sig-network] pods should successfully create sandboxes by adding pod to network fails with error adding pod to CNI network

openshift-trt-bot avatar Aug 24 '24 18:08 openshift-trt-bot

/retest

kannon92 avatar Sep 01 '24 02:09 kannon92

Job Failure Risk Analysis for sha: 4a7181edf225cfd1be85f5322eaaa7f03f87380b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Medium
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 93.33% of 120 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Sep 01 '24 05:09 openshift-trt-bot

/hold cancel

kannon92 avatar Sep 11 '24 16:09 kannon92

/retest

kannon92 avatar Sep 11 '24 19:09 kannon92

/lgtm /approve /hold

you may release the hold when you're confident this is ready.

deads2k avatar Sep 11 '24 19:09 deads2k

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Sep 11 '24 19:09 openshift-ci[bot]