origin OCPBUGS-38388: Fail on FailedToLease events for kubelet log collector

Aug 12 '24 23:08 kannon92

/jira refresh

Aug 13 '24 00:08 kannon92

@kannon92: No Jira issue is referenced in the title of this pull request. To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 13 '24 00:08 openshift-ci-robot

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is invalid:

expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 13 '24 00:08 openshift-ci-robot

/jira refresh

Aug 13 '24 00:08 kannon92

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is invalid:

expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 13 '24 00:08 openshift-ci-robot

/jira refresh

Aug 13 '24 00:08 kannon92

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is invalid:

expected the bug to target only the "4.18.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 13 '24 00:08 openshift-ci-robot

/jira refresh

Aug 13 '24 00:08 kannon92

@kannon92: This pull request references Jira Issue OCPBUGS-38388, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 13 '24 00:08 openshift-ci-robot

Job Failure Risk Analysis for sha: ea11c108b00d599951d0f7376d1937d069bbd0e8

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial	IncompleteTests Tests for this run (21) are below the historical average (462): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-csi	Medium [sig-network] can collect pod-to-host poller pod logs This test has passed 94.74% of 19 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi'] in the last 14 days. Open Bugs collecting poller pod logs failing in e2e-vsphere-ovn jobs --- [sig-network] can collect host-to-host poller pod logs This test has passed 94.74% of 19 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi'] in the last 14 days.

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial

IncompleteTests
Tests for this run (21) are below the historical average (462): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

pull-ci-openshift-origin-master-e2e-gcp-csi

Medium
[sig-network] can collect pod-to-host poller pod logs
This test has passed 94.74% of 19 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi'] in the last 14 days.

Open Bugs
collecting poller pod logs failing in e2e-vsphere-ovn jobs
---
[sig-network] can collect host-to-host poller pod logs
This test has passed 94.74% of 19 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi'] in the last 14 days.

Aug 15 '24 00:08 openshift-trt-bot

/retest

Aug 15 '24 01:08 kannon92

Job Failure Risk Analysis for sha: 4bb45c498addf50a9ad905ab07b9029758de3aa7

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones	IncompleteTests Tests for this run (101) are below the historical average (1558): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Aug 15 '24 06:08 openshift-trt-bot

/retest

Aug 15 '24 12:08 kannon92

Job Failure Risk Analysis for sha: e2ef4d0e67e859525efdf432285caa2c6ef76c80

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial	High [bz-Management Console] clusteroperator/console should not change condition/Available This test has passed 100.00% of 36 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. Open Bugs [bz-Management Console] clusteroperator/console should not change condition/Available

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial

High
[bz-Management Console] clusteroperator/console should not change condition/Available
This test has passed 100.00% of 36 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
[bz-Management Console] clusteroperator/console should not change condition/Available

Aug 15 '24 18:08 openshift-trt-bot

/retest

Aug 16 '24 02:08 kannon92

Job Failure Risk Analysis for sha: 304b1ebf85f7f6017de20c184650dc87cb3657e5

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	High [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator This test has passed 99.17% of 121 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Aug 16 '24 05:08 openshift-trt-bot

Job Failure Risk Analysis for sha: 4bfbeeaa493cb1c8de763b7448434e1d7ccca321

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade	High [sig-arch] events should not repeat pathologically for ns/openshift-machine-api This test has passed 99.86% of 720 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-serial	High [sig-network-edge][Feature:Idling] Unidling with Deployments [apigroup:route.openshift.io] should handle many TCP connections by possibly dropping those over a certain bound [Serial] [Suite:openshift/conformance/serial] This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	Medium [sig-network-edge] Verify DNS availability during and after upgrade success This test has passed 94.77% of 172 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Aug 19 '24 18:08 openshift-trt-bot

/retest-required

Aug 19 '24 18:08 kannon92

Looks like the test isn't de-duping as expected. Have a look at https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/28999/pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial/1825557058091487232 . Duplicate instances of the same lease failure? Logic error in display?

Aug 20 '24 16:08 deads2k

very neat

Aug 19 17:30:23.156577 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.156522 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.54.68:6443: connect: connection refused" Aug 19 17:30:23.159659 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.159621 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.87.235:6443: connect: connection refused" Aug 19 17:30:23.163067 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.163032 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.54.68:6443: connect: connection refused" Aug 19 17:30:23.167030 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.166982 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.87.235:6443: connect: connection refused" Aug 19 17:30:23.170620 ip-10-0-77-156 kubenswrapper[2478]: E0819 17:30:23.170584 2478 controller.go:195] "Failed to update lease" err="Put "https://api-int.ci-op-ht2pcfvh-a6aef.aws-2.ci.openshift.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-77-156.us-west-2.compute.internal?timeout=10s": dial tcp 10.0.54.68:6443: connect: connection refused" Aug 19 17:30:23.170756 ip-10-0-77-156 kubenswrapper[2478]: I0819 17:30:23.170623 2478 controller.go:115] "failed to update lease using latest lease, fallback to ensure lease" err="failed 5 attempts to update lease"

from the log. Backoff logic is interesting, five rapids, then a sleep. Perhaps we're looking for Failed to ensure lease exists instead?

Aug 20 '24 16:08 deads2k

/hold

Going to look at the logs tomorrow to see if this works.

Aug 20 '24 21:08 kannon92

/retest

Aug 21 '24 11:08 kannon92

Job Failure Risk Analysis for sha: e805d53c4dd605c0c47e7297394c8b43b38907ec

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	IncompleteTests Tests for this run (20) are below the historical average (717): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6	IncompleteTests Tests for this run (20) are below the historical average (1830): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn	IncompleteTests Tests for this run (20) are below the historical average (2020): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Aug 21 '24 14:08 openshift-trt-bot

Job Failure Risk Analysis for sha: 1558beaef3bd8ce46cccd05b0ee32490736cd9dc

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	Medium [sig-network] pods should successfully create sandboxes by adding pod to network This test has passed 80.85% of 141 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. Open Bugs s390x: [sig-network] pods should successfully create sandboxes by adding pod to network fails with error adding pod to CNI network

Job Name

Failure Risk

pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade

Medium
[sig-network] pods should successfully create sandboxes by adding pod to network
This test has passed 80.85% of 141 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
s390x: [sig-network] pods should successfully create sandboxes by adding pod to network fails with error adding pod to CNI network

Aug 24 '24 18:08 openshift-trt-bot

/retest

Sep 01 '24 02:09 kannon92

Job Failure Risk Analysis for sha: 4a7181edf225cfd1be85f5322eaaa7f03f87380b

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	Medium [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator This test has passed 93.33% of 120 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Sep 01 '24 05:09 openshift-trt-bot

/hold cancel

Sep 11 '24 16:09 kannon92

/retest

Sep 11 '24 19:09 kannon92

/lgtm /approve /hold

you may release the hold when you're confident this is ready.

Sep 11 '24 19:09 deads2k

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Sep 11 '24 19:09 openshift-ci[bot]

origin origin copied to clipboard

OCPBUGS-38388: Fail on FailedToLease events for kubelet log collector

origin
origin copied to clipboard