Add tests verifying that alerts expressions are valid
Add a test verifying that the underlying expressions of an alert are still relevant. The goal behind such tests is to detect alerts that are out-of-sync with the platform, whether it is because a metric was removed or one of its labels was changed. To do so, we want to make sure that all the metrics and their selectors aren't absent. However, this comes with the trade-off that we won't be able to detect if an error-related alert is still relevant since in most cases, error metrics are absent if no error occurred.
One use case that we recently ran into is when the buckets of a metric changed and an alert using them hasn't been updated. In that case, we want to detect that the alert isn't doing anything anymore in order to fix it.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: dgrisonnet
To complete the pull request process, please assign stbenjam after the PR has been reviewed.
You can assign the PR to them by writing /assign @stbenjam in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
@openshift/openshift-team-monitoring could you please have a look at this PR?
@dgrisonnet: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/images | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test images |
| ci/prow/e2e-aws-serial | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-serial |
| ci/prow/e2e-gcp | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp |
| ci/prow/e2e-aws-cgroupsv2 | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-aws-cgroupsv2 |
| ci/prow/e2e-aws-fips | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-fips |
| ci/prow/e2e-gcp-builds | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-builds |
| ci/prow/e2e-gcp-upgrade | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-upgrade |
| ci/prow/e2e-agnostic-cmd | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-agnostic-cmd |
| ci/prow/e2e-aws-single-node | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-aws-single-node |
| ci/prow/e2e-aws-csi | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-aws-csi |
| ci/prow/verify | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test verify |
| ci/prow/e2e-gcp-csi | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-gcp-csi |
| ci/prow/lint | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test lint |
| ci/prow/e2e-aws-ovn-serial | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-ovn-serial |
| ci/prow/e2e-aws-ovn-fips | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-ovn-fips |
| ci/prow/e2e-gcp-ovn | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-ovn |
| ci/prow/e2e-gcp-ovn-upgrade | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-ovn-upgrade |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten /remove-lifecycle stale
@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/images | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test images |
| ci/prow/e2e-aws-serial | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-serial |
| ci/prow/e2e-gcp | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp |
| ci/prow/e2e-aws-cgroupsv2 | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-aws-cgroupsv2 |
| ci/prow/e2e-aws-fips | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-fips |
| ci/prow/e2e-gcp-builds | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-builds |
| ci/prow/e2e-gcp-upgrade | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-upgrade |
| ci/prow/e2e-agnostic-cmd | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-agnostic-cmd |
| ci/prow/e2e-aws-single-node | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-aws-single-node |
| ci/prow/e2e-aws-csi | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-aws-csi |
| ci/prow/verify | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test verify |
| ci/prow/e2e-gcp-csi | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | false | /test e2e-gcp-csi |
| ci/prow/lint | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test lint |
| ci/prow/e2e-aws-ovn-serial | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-ovn-serial |
| ci/prow/e2e-aws-ovn-fips | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-ovn-fips |
| ci/prow/e2e-gcp-ovn | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-ovn |
| ci/prow/e2e-gcp-ovn-upgrade | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-ovn-upgrade |
| ci/prow/unit | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test unit |
| ci/prow/e2e-gcp-ovn-image-ecosystem | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-ovn-image-ecosystem |
| ci/prow/e2e-aws-ovn-image-registry | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-aws-ovn-image-registry |
| ci/prow/e2e-gcp-ovn-builds | 275b8ba1e1a3483c02e548fcc64ca4b0d4801015 | link | true | /test e2e-gcp-ovn-builds |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closed this PR.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen. Mark the issue as fresh by commenting/remove-lifecycle rotten. Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.