origin icon indicating copy to clipboard operation
origin copied to clipboard

pkg/synthetictests/operators: Fatal unless Available=False in allow-list

Open wking opened this issue 3 years ago • 5 comments

Filling in known bugs from here (~WIP: not actually full yet~). This allows us to fail new regressions, and gradually ratchet tighter as we close out existing issues.

wking avatar Jun 08 '22 00:06 wking

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking To complete the pull request process, please assign bparees after the PR has been reviewed. You can assign the PR to them by writing /assign @bparees in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jun 08 '22 00:06 openshift-ci[bot]

Success:

image

I'll expand the exception list...

wking avatar Jun 08 '22 16:06 wking

WIP: not actually full yet

I've filled in the list of regexps in 78c5e91d56, so dropping the WIP.

wking avatar Jun 08 '22 17:06 wking

New phrasing:

image

wking avatar Jun 09 '22 04:06 wking

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-rt-upgrade 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-agnostic-cmd 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-agnostic-cmd
ci/prow/e2e-aws-single-node-upgrade 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-aws-single-node-upgrade
ci/prow/e2e-aws-ovn-fips 054f30d2224d39fa81946f8aa9602283c588c61e link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-serial 054f30d2224d39fa81946f8aa9602283c588c61e link true /test e2e-aws-ovn-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Aug 31 '22 18:08 openshift-ci[bot]

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-rt-upgrade 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-agnostic-cmd 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-agnostic-cmd
ci/prow/e2e-aws-single-node-upgrade 054f30d2224d39fa81946f8aa9602283c588c61e link false /test e2e-aws-single-node-upgrade
ci/prow/e2e-aws-ovn-fips 054f30d2224d39fa81946f8aa9602283c588c61e link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-serial 054f30d2224d39fa81946f8aa9602283c588c61e link true /test e2e-aws-ovn-serial
ci/prow/e2e-gcp-ovn-image-ecosystem 054f30d2224d39fa81946f8aa9602283c588c61e link true /test e2e-gcp-ovn-image-ecosystem

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Nov 05 '22 00:11 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Feb 04 '23 01:02 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Mar 06 '23 08:03 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot avatar Apr 06 '23 00:04 openshift-bot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Apr 06 '23 00:04 openshift-ci[bot]

/remove-lifecycle stale /reopen

wking avatar Oct 03 '23 17:10 wking

@wking: Reopened this PR.

In response to this:

/remove-lifecycle stale /reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Oct 03 '23 17:10 openshift-ci[bot]

/remove-lifecycle rotten

wking avatar Oct 03 '23 17:10 wking

@wking: This pull request references OTA-362 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

Filling in known bugs from here (~WIP: not actually full yet~). This allows us to fail new regressions, and gradually ratchet tighter as we close out existing issues.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot avatar Oct 03 '23 17:10 openshift-ci-robot

/cc

petr-muller avatar Oct 04 '23 10:10 petr-muller

Job Failure Risk Analysis for sha: 23301f18ea880511c05c6e94112d9756eabb953c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (15) are below the historical average (1897): IncompleteTests
pull-ci-openshift-origin-master-e2e-metal-ipi-sdn IncompleteTests
Tests for this run (14) are below the historical average (1558): IncompleteTests
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (14) are below the historical average (1514): IncompleteTests
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade IncompleteTests
Tests for this run (17) are below the historical average (733): IncompleteTests
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (17) are below the historical average (740): IncompleteTests
pull-ci-openshift-origin-master-e2e-gcp-ovn IncompleteTests
Tests for this run (16) are below the historical average (1863): IncompleteTests
pull-ci-openshift-origin-master-e2e-gcp-csi IncompleteTests
Tests for this run (16) are below the historical average (701): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade IncompleteTests
Tests for this run (18) are below the historical average (706): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade IncompleteTests
Tests for this run (17) are below the historical average (2057): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial IncompleteTests
Tests for this run (15) are below the historical average (708): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node IncompleteTests
Tests for this run (15) are below the historical average (1793): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-serial IncompleteTests
Tests for this run (16) are below the historical average (778): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-fips IncompleteTests
Tests for this run (16) are below the historical average (2006): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2 IncompleteTests
Tests for this run (15) are below the historical average (1970): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-csi IncompleteTests
Tests for this run (16) are below the historical average (744): IncompleteTests
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (16) are below the historical average (669): IncompleteTests

openshift-trt-bot avatar Nov 21 '23 23:11 openshift-trt-bot

Job Failure Risk Analysis for sha: c07c721963505f6b1f9fd532ef0e938e16c74f0d

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (18) are below the historical average (1843): IncompleteTests

openshift-trt-bot avatar Nov 22 '23 03:11 openshift-trt-bot

Job Failure Risk Analysis for sha: 8e7818ace7e2fd009fbe9796ecebd19f71420a82

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (22) are below the historical average (1630): IncompleteTests

openshift-trt-bot avatar Nov 23 '23 04:11 openshift-trt-bot

Job Failure Risk Analysis for sha: b9240a4b627b4338fc4a27558e809d3f376b6d96

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (15) are below the historical average (1528): IncompleteTests

openshift-trt-bot avatar Nov 24 '23 17:11 openshift-trt-bot

Job Failure Risk Analysis for sha: 79212ee2f7c2351b4a9bcb34b9bf75178632c60d

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (16) are below the historical average (551): IncompleteTests

openshift-trt-bot avatar Nov 27 '23 19:11 openshift-trt-bot

AWS OVN update failed an Available test-case:

: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less	2h25m38s
{  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Nov 27 17:45:23.986 E clusteroperator/authentication condition/Available reason/APIServerDeployment_NoDeployment status/False APIServerDeploymentAvailable: deployment/openshift-oauth-apiserver: could not be retrieved
Nov 27 17:45:23.986 - 3s    E clusteroperator/authentication condition/Available reason/APIServerDeployment_NoDeployment status/False APIServerDeploymentAvailable: deployment/openshift-oauth-apiserver: could not be retrieved

1 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Nov 27 17:45:27.420 W clusteroperator/authentication condition/Available reason/AsExpected status/True All is well (exception: Available=True is the happy case)
}

And we don't have an exception for APIServerDeployment_NoDeployment, so that looks like it's working. The run also flaked a Degraded test-case with We are not worried about Degraded=True blips for update tests yet, so that's also working:

: [bz-DNS] clusteroperator/dns should not change condition/Degraded expand_less
Run #0: Failed expand_less	2h25m38s
{  0 unexpected clusteroperator state transitions during e2e test run, as desired.
6 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Nov 27 17:48:34.907 E clusteroperator/dns condition/Degraded reason/DNSDegraded status/True DNS default is degraded (exception: We are not worried about Degraded=True blips for update tests yet.)
Nov 27 17:48:34.907 - 42s   E clusteroperator/dns condition/Degraded reason/DNSDegraded status/True DNS default is degraded (exception: We are not worried about Degraded=True blips for update tests yet.)
Nov 27 17:49:17.518 W clusteroperator/dns condition/Degraded reason/DNSNotDegraded status/False (exception: Degraded=False is the happy case)
Nov 27 18:43:16.191 E clusteroperator/dns condition/Degraded reason/DNSDegraded status/True DNS default is degraded (exception: We are not worried about Degraded=True blips for update tests yet.)
Nov 27 18:43:16.191 - 9s    E clusteroperator/dns condition/Degraded reason/DNSDegraded status/True DNS default is degraded (exception: We are not worried about Degraded=True blips for update tests yet.)
Nov 27 18:43:26.190 W clusteroperator/dns condition/Degraded reason/DNSNotDegraded status/False (exception: Degraded=False is the happy case)
}

AWS OVN serial flaked a test-case with We are not worried about Available=False or Degraded=True blips for stable-system tests yet, so that looks good too:

: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less
Run #0: Failed expand_less	1h27m18s
{  0 unexpected clusteroperator state transitions during e2e test run, as desired.
6 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Nov 27 17:02:30.828 E clusteroperator/authentication condition/Available reason/APIServices_PreconditionNotReady status/False APIServicesAvailable: PreconditionNotReady (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.)
Nov 27 17:02:30.828 - 105s  E clusteroperator/authentication condition/Available reason/APIServices_PreconditionNotReady status/False APIServicesAvailable: PreconditionNotReady (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.)
...

wking avatar Nov 27 '23 23:11 wking

/payload 4.15 nightly blocking

This all looks good to me so far, per slack I'd propose we just make sure we're not likely to take out the nightly payload, and then merge, monitor for what's failing it, and you can keep an eye out on your tool of choice for where it's failing.

if it helps, there is a small framework in sippy capable of extracting metadata out of test output. If you wanted to parse your output from this test and wind up with a json blob in the sippy db you could then query/group/aggregate with sql, we've used it in the past to find top offenders programmatically. https://github.com/openshift/sippy/blob/master/pkg/dataloader/prowloader/testoutputmetadata.go

dgoodwin avatar Nov 28 '23 12:11 dgoodwin

@dgoodwin: trigger 8 job(s) of type blocking for the nightly release of OCP 4.15

  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2493d350-8dec-11ee-95db-8de7120989e4-0

openshift-ci[bot] avatar Nov 28 '23 12:11 openshift-ci[bot]

I've pushed 79212ee2f7 -> 1821445c79 with the following pivots:

  • Including a JUnit for each operator/condition pair in the happy-case "no surprising blips", to avoid we require at least 6 attempts to have a chance at success failures in aggregate runs.
  • Expanded reason matching for authentication and monitoring, to cover the reasons seen recently in 4.15 update CI.
  • Fixed reason for operator-lifecycle-manager-packageserver (ClusterServiceVersion to ClusterServiceVersionNotSucceeded).
  • New OCPBUGS-24041 matching some console blips.

Once presubmits give positive signs, I'll launch a new round of blocker payload jobs.

wking avatar Nov 28 '23 18:11 wking

/payload 4.15 nightly blocking

wking avatar Nov 29 '23 00:11 wking

@wking: trigger 8 job(s) of type blocking for the nightly release of OCP 4.15

  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/40875bb0-8e4e-11ee-959f-9c9eb36a842c-0

openshift-ci[bot] avatar Nov 29 '23 00:11 openshift-ci[bot]

/payload 4.15 nightly blocking

wking avatar Nov 29 '23 06:11 wking

@wking: trigger 8 job(s) of type blocking for the nightly release of OCP 4.15

  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/afc00530-8e7c-11ee-9ec5-48e1a7780da5-0

openshift-ci[bot] avatar Nov 29 '23 06:11 openshift-ci[bot]

Job Failure Risk Analysis for sha: ba523b5b481f02ee0bad1f60842838f0831cb9fb

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial High
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 49 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial'] in the last 14 days.

openshift-trt-bot avatar Nov 29 '23 08:11 openshift-trt-bot

/payload 4.15 nightly blocking

wking avatar Nov 30 '23 05:11 wking

@wking: trigger 8 job(s) of type blocking for the nightly release of OCP 4.15

  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6833e8f0-8f42-11ee-9337-ff867cada0ed-0

openshift-ci[bot] avatar Nov 30 '23 05:11 openshift-ci[bot]

/lgtm /hold Cancel whenever you're ready!

dgoodwin avatar Nov 30 '23 11:11 dgoodwin

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Nov 30 '23 11:11 openshift-ci[bot]