origin icon indicating copy to clipboard operation
origin copied to clipboard

OCPBUGS-55238: spyglass: hide disruption events for localhost

Open vrutkovs opened this issue 8 months ago • 6 comments

Don't display localhost-related disruptions on spyglass. These are still displayed on non-spyglass reports in case unexpected localhost disruption happens

vrutkovs avatar Apr 24 '25 09:04 vrutkovs

@vrutkovs: This pull request references Jira Issue OCPBUGS-55238, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Don't display localhost-related disruptions on spyglass. These are still displayed on non-spyglass reports in case unexpected localhost disruption happens

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Apr 24 '25 09:04 openshift-ci-robot

/jira refresh

vrutkovs avatar Apr 24 '25 09:04 vrutkovs

@vrutkovs: This pull request references Jira Issue OCPBUGS-55238, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Apr 24 '25 09:04 openshift-ci-robot

Risk analysis has seen new tests most likely introduced by this PR. Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: c900caa870cc47f362d593748254b99121307085

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-serial Medium - "Find all of the input images from ocp/4.20 and tag them into the stable stream" is a new test, and was only seen in one job.
pull-ci-openshift-origin-main-e2e-aws-ovn-serial Medium - "Find all of the input images from ocp/4.20 and tag them into the stable-initial stream" is a new test, and was only seen in one job.

New tests seen in this PR at sha: c900caa870cc47f362d593748254b99121307085

  • "Find all of the input images from ocp/4.20 and tag them into the stable stream" [Total: 1, Pass: 1, Fail: 0, Flake: 0]
  • "Find all of the input images from ocp/4.20 and tag them into the stable-initial stream" [Total: 1, Pass: 1, Fail: 0, Flake: 0]

openshift-trt[bot] avatar May 02 '25 12:05 openshift-trt[bot]

The problem with leaving expected disruption in and hiding it in the UI is the larger system used to monitor disruption data, all of which needs the same accommodations otherwise it flags localhost disruption as disruption and starts monitoring for changes. This would include the grafana dashboard, the alerts in dpcr cluster, and the metrics published by sippy for those alerts, as well as scheduled queries in bigquery used for the reporting.

Do you intend to have this monitored for changes in disruption and pursue fixes for those issues?

If so then maybe we leave it in. (but we wouldn't to hide it on interval charts)

If not, these intervals really should be classified with a different source. That would immediately remove them from the analysis framework, and they would not appear in this chart.

Also remember the new intervals UI under debug tools is at https://github.com/openshift/sippy/blob/main/sippy-ng/src/prow_job_runs/IntervalsChart.js and it is largely based on categorizing by Source.

dgoodwin avatar Jul 21 '25 11:07 dgoodwin

Do you intend to have this monitored for changes in disruption and pursue fixes for those issues?

Localhost disruptions are expected when pod restarts (on rollout), but may be misleading - in most cases they are expected to happen.

If so then maybe we leave it in. (but we wouldn't to hide it on interval charts)

We're hiding them on the main chart, but leaving on non-spyglass charts for completeness.

If not, these intervals really should be classified with a different source. That would immediately remove them from the analysis framework, and they would not appear in this chart.

I don't think these are being sent for analysis anyway

vrutkovs avatar Jul 21 '25 16:07 vrutkovs

They have been spamming #trt-alerts for weeks now, up to and including today, they are definitely going into the analysis system.

Can you skip generating the intervals when it's expected?

dgoodwin avatar Jul 21 '25 17:07 dgoodwin

I think it's easier to move them to a different source

vrutkovs avatar Jul 22 '25 08:07 vrutkovs

This looks great, thank you, just waiting to see the resulting files.

dgoodwin avatar Jul 22 '25 11:07 dgoodwin

https://sippy.dptools.openshift.org/sippy-ng/job_runs/1947614946313375744/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/openshift_origin/29710/intervals?end=2025-07-22T13%3A43%3A07Z&filterText=&intervalFile=e2e-timelines_spyglass_20250722-123441.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState&selectedSources=DisruptionLocalhost&start=2025-07-22T11%3A56%3A40Z

Looks good to me.

/lgtm /hold

Release when you're happy with the results.

dgoodwin avatar Jul 22 '25 14:07 dgoodwin

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jul 22 '25 14:07 openshift-ci[bot]

/hold cancel

Yup, looks good

vrutkovs avatar Jul 22 '25 16:07 vrutkovs

/retest-required

Remaining retests: 0 against base HEAD af0e85d21e1c8f02c7c0272b4a7e6f0d6f9db314 and 2 for PR HEAD da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 in total

openshift-ci-robot avatar Jul 22 '25 16:07 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD 47eed7a6649663d0685122c61e44f7a0a63049b0 and 1 for PR HEAD da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 in total

openshift-ci-robot avatar Jul 23 '25 00:07 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD b392d63f16d05e3a6d8e4673a67d362ccc0f6de3 and 2 for PR HEAD da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 in total

openshift-ci-robot avatar Jul 23 '25 06:07 openshift-ci-robot

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback c900caa870cc47f362d593748254b99121307085 link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview c900caa870cc47f362d593748254b99121307085 link false /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview
ci/prow/okd-e2e-gcp c900caa870cc47f362d593748254b99121307085 link false /test okd-e2e-gcp
ci/prow/e2e-gcp-fips-serial c900caa870cc47f362d593748254b99121307085 link false /test e2e-gcp-fips-serial
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-techpreview c900caa870cc47f362d593748254b99121307085 link false /test e2e-metal-ipi-ovn-dualstack-bgp-techpreview
ci/prow/e2e-metal-ipi-serial c900caa870cc47f362d593748254b99121307085 link false /test e2e-metal-ipi-serial
ci/prow/e2e-metal-ipi-serial-ovn-ipv6 c900caa870cc47f362d593748254b99121307085 link false /test e2e-metal-ipi-serial-ovn-ipv6
ci/prow/e2e-aws-ovn-serial c900caa870cc47f362d593748254b99121307085 link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-serial-publicnet c900caa870cc47f362d593748254b99121307085 link true /test e2e-aws-ovn-serial-publicnet
ci/prow/e2e-aws-ovn-kube-apiserver-rollout da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-gcp-ovn-rt-upgrade da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-etcd-scaling da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/okd-scos-e2e-aws-ovn da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-disruptive da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-disruptive
ci/prow/e2e-gcp-fips-serial-2of2 da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-fips-serial-2of2
ci/prow/e2e-azure-ovn-etcd-scaling da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-openstack-serial da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-openstack-serial
ci/prow/e2e-azure-ovn-upgrade da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-gcp-ovn-techpreview da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-openstack-ovn da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-disruptive da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-aws-disruptive
ci/prow/e2e-aws-ovn-microshift-serial da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-gcp-ovn-etcd-scaling da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-microshift da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-aws-ovn-microshift
ci/prow/e2e-gcp-fips-serial-1of2 da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-fips-serial-1of2
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-gcp-ovn-techpreview-serial-2of2
ci/prow/e2e-aws-ovn-single-node-upgrade da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-vsphere-ovn-etcd-scaling da1a05cc62aef50f4ac6e3cd1c6a632c589628e5 link false /test e2e-vsphere-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Jul 23 '25 10:07 openshift-ci[bot]

Job Failure Risk Analysis for sha: da1a05cc62aef50f4ac6e3cd1c6a632c589628e5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive IncompleteTests
Tests for this run (106) are below the historical average (341): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling Low
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week.

Open Bugs
etcd-scaling jobs failing ~60% of the time
---
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week.

Open Bugs
etcd-scaling jobs failing ~60% of the time

openshift-trt[bot] avatar Jul 23 '25 10:07 openshift-trt[bot]

@vrutkovs: Jira Issue OCPBUGS-55238: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-55238 has been moved to the MODIFIED state.

In response to this:

Don't display localhost-related disruptions on spyglass. These are still displayed on non-spyglass reports in case unexpected localhost disruption happens

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 23 '25 13:07 openshift-ci-robot

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests This PR has been included in build openshift-enterprise-tests-container-v4.20.0-202507231546.p0.g848143e.assembly.stream.el9. All builds following this will include this PR.

openshift-bot avatar Jul 23 '25 19:07 openshift-bot

/cherry-pick release-4.19

wangke19 avatar Jul 28 '25 12:07 wangke19

@wangke19: new pull request created: #30023

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.