origin icon indicating copy to clipboard operation
origin copied to clipboard

NO-JIRA: add watch machines monitor test for phase changes of machines

Open kannon92 opened this issue 1 year ago • 6 comments
trafficstars

kannon92 avatar Aug 22 '24 21:08 kannon92

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92 Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Aug 22 '24 21:08 openshift-ci[bot]

/cc @elmiko @deads2k

/hold

Going to make sure intervals are created before I merge.

I will also use these intervals to fix the unexpected node not ready failures but I wanted to debug these monitor tests first.

kannon92 avatar Aug 22 '24 21:08 kannon92

@kannon92: This pull request explicitly references no jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 22 '24 21:08 openshift-ci-robot

Intervals exist so pushed up a fix to filter out these unexpected events if they fall within a machine phase change.

kannon92 avatar Aug 26 '24 21:08 kannon92

Job Failure Risk Analysis for sha: 16567582d901c5f5074476064f3fe2e5e3554834

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Low
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-marketplace
This test has passed 37.96% of 108 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Aug 26 '24 23:08 openshift-trt-bot

Job Failure Risk Analysis for sha: cac5fac5fd081462bb0a9d36157467b975d4e74e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade IncompleteTests

openshift-trt-bot avatar Aug 29 '24 01:08 openshift-trt-bot

/retest-required

kannon92 avatar Aug 29 '24 02:08 kannon92

/hold cancel

kannon92 avatar Aug 29 '24 16:08 kannon92

Job Failure Risk Analysis for sha: 68812179e3bf6add5fd721c895ba7cd955786418

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade IncompleteTests
Tests for this run (104) are below the historical average (820): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
operator conditions kube-apiserver
This test has passed 60.00% of 15 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

Open Bugs
operator conditions kube-apiserver failures
---
[sig-sippy] tests should finish with healthy operators
This test has passed 60.00% of 15 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Aug 29 '24 19:08 openshift-trt-bot

/retest-required

kannon92 avatar Aug 29 '24 19:08 kannon92

/retest

kannon92 avatar Sep 01 '24 02:09 kannon92

Job Failure Risk Analysis for sha: e0912f9dd628ea6d108d2e14baa778f47dec4ef9

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout High
[sig-network] there should be nearly zero single second disruptions for openshift-api-http2-internal-lb-reused-connections
This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Medium
[sig-network] pods should successfully create sandboxes by adding pod to network
This test has passed 91.67% of 120 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
s390x: [sig-network] pods should successfully create sandboxes by adding pod to network fails with error adding pod to CNI network

openshift-trt-bot avatar Sep 01 '24 06:09 openshift-trt-bot

Job Failure Risk Analysis for sha: 9fa46c5d27f4589c034499f05c1852a446e1ffca

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial High
[bz-Management Console] clusteroperator/console should not change condition/Available
This test has passed 100.00% of 39 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-arch] events should not repeat pathologically for ns/openshift-ovn-kubernetes
This test has passed 100.00% of 39 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-arch] events should not repeat pathologically
This test has passed 100.00% of 39 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-arch] events should not repeat pathologically for ns/openshift-authentication-operator
This test has passed 100.00% of 39 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
Showing 4 of 5 test results

openshift-trt-bot avatar Sep 04 '24 01:09 openshift-trt-bot

capacity issues

/retest

deads2k avatar Sep 04 '24 18:09 deads2k

@kannon92: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade 37d75072920c45286cb91a1e89cd283465cfeba2 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-metal-ipi-ovn 37d75072920c45286cb91a1e89cd283465cfeba2 link false /test e2e-metal-ipi-ovn
ci/prow/e2e-aws-ovn-ipsec-serial 37d75072920c45286cb91a1e89cd283465cfeba2 link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 37d75072920c45286cb91a1e89cd283465cfeba2 link false /test e2e-aws-ovn-kube-apiserver-rollout

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Sep 04 '24 21:09 openshift-ci[bot]

@deads2k This is working. Can we go ahead and merge?

kannon92 avatar Sep 05 '24 14:09 kannon92

@deads2k This is working. Can we go ahead and merge?

the timelines need to not end up at the beginning of the epoch

deads2k avatar Sep 05 '24 14:09 deads2k

@deads2k This is working. Can we go ahead and merge?

the timelines need to not end up at the beginning of the epoch

Yes. I created https://issues.redhat.com/browse/TRT-1798 to have TRT look into.

I thought that this was a render issue. The intervals seem fine in the files.

           "message": {
                "reason": "MachinePhaseChange",
                "cause": "",
                "humanMessage": "Machine phase changed from \u003cmissing\u003e to Provisioning",
                "annotations": {
                    "node": "\u003cunknown\u003e",
                    "phase": "Provisioning",
                    "previousPhase": "\u003cmissing\u003e",
                    "reason": "MachinePhaseChange"
                }
            },
            "from": "2024-09-04T19:59:05Z",
            "to": "2024-09-04T19:59:05Z"
        },      

This From/To is not at the end.

kannon92 avatar Sep 05 '24 14:09 kannon92

/close

@deads2k proposed an improve that hopefully fixes the bound machine problem.

kannon92 avatar Sep 05 '24 18:09 kannon92

/close

kannon92 avatar Sep 05 '24 18:09 kannon92

@kannon92: Closed this PR.

In response to this:

/close

@deads2k proposed an improve that hopefully fixes the bound machine problem.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Sep 05 '24 18:09 openshift-ci[bot]