hyperconverged-cluster-operator
hyperconverged-cluster-operator copied to clipboard
Add OperatorConditionsUnhealthy alerts
What this PR does / why we need it:
Add alerts based on kubevirt_hco_system_health_status
The alert includes the reason of the condition that sets the health to a given status. For that, kubevirt_hco_system_health_status was updated to add a reason label, that has the value of the reason field in the underlying condition
Reviewer Checklist
Reviewers are supposed to review the PR for every aspect below one by one. To check an item means the PR is either "OK" or "Not Applicable" in terms of that item. All items are supposed to be checked before merging a PR.
- [ ] PR Message
- [ ] Commit Messages
- [ ] How to test
- [ ] Unit Tests
- [ ] Functional Tests
- [ ] User Documentation
- [ ] Developer Documentation
- [ ] Upgrade Scenario
- [ ] Uninstallation Scenario
- [ ] Backward Compatibility
- [ ] Troubleshooting Friendly
Jira Ticket:
https://issues.redhat.com/browse/CNV-48799
Release note:
Add OperatorConditionsUnhealthy alerts
Pull Request Test Coverage Report for Build 11723222014
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 59 of 59 (100.0%) changed or added relevant lines in 4 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage increased (+0.1%) to 72.595%
| Totals | |
|---|---|
| Change from base Build 11611012763: | 0.1% |
| Covered Lines: | 6082 |
| Relevant Lines: | 8378 |
💛 - Coveralls
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure
In response to this:
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@machadovilaca In this PR you are adding a new alert but also updating the metrics with the reason. Please explain in the description how the reason is calculated.
@machadovilaca In this PR you are adding a new alert but also updating the metrics with the reason. Please explain in the description how the reason is calculated.
@sradco added
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure
In response to this:
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@machadovilaca can you please add an example for the metrics with the new reason label? And explain what happens if there are a few conditions that are not healthy.
@avlitman please review
@sradco - it currently does not pass CI - including sanity test.
I think it needs at least #3153
@machadovilaca, I find the name of the alert confusing. The term OperatorCondition is used for something else in OLM, and it may misguide users.
@machadovilaca, I find the name of the alert confusing. The term
OperatorConditionis used for something else in OLM, and it may misguide users.
@nunnatsa how would you call this alert? It is firing based on the HCO CR conditions.
@machadovilaca, I find the name of the alert confusing. The term
OperatorConditionis used for something else in OLM, and it may misguide users.@nunnatsa how would you call this alert? It is firing based on the HCO CR conditions.
Maybe HCOOperatorConditionUnhealthy or HyperConvergedConditionsUnhealthy?
@machadovilaca, I find the name of the alert confusing. The term
OperatorConditionis used for something else in OLM, and it may misguide users.@nunnatsa how would you call this alert? It is firing based on the HCO CR conditions.
Maybe
HCOOperatorConditionUnhealthyorHyperConvergedConditionsUnhealthy?
I support HCOOperatorConditionUnhealthy name.
@machadovilaca need to rebase (operatorrules should be operatorRegistry) and change alert name. other than that looks good to me
@nunnatsa @machadovilaca I would like to use a generic alert name that can be used by other operators. That is why HCO is not in the alert name. The operator name is in a label.
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure
In response to this:
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/approve
@machadovilaca - the "should fired when KubeDescheduler is installed and not properly configured for KubeVirt" e2e test is failing.
The reason for that is the addition of the reason label, while in HCOPrometheusClient::GetHCOMetric(), we only remove the query text from the result metric line, and then trying to convert text into int, but we still have the label in the text so the conversion fails.
I suggest to change tests/func-tests/hco_prometheus_route.go:
var metricValueRegex = regexp.MustCompile(` (\d+(?:\.\d+)?)$`) // <== add this
func (hcoCli HCOPrometheusClient) GetHCOMetric(ctx context.Context, query string) (float64, error) {
req, err := http.NewRequest(http.MethodGet, hcoCli.url, nil)
if err != nil {
return 0, fmt.Errorf("failed to create request: %w", err)
}
resp, err := hcoCli.cli.Do(req.WithContext(ctx))
if err != nil {
return 0, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return 0, fmt.Errorf("failed to read the temp route status: %s", resp.Status)
}
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, query) {
groups := metricValueRegex.FindStringSubmatch(line) // <== add this
if len(groups) != 2 { // <== add this
return 0, fmt.Errorf("no numeric value in: %q", line) // <== add this
}
res, err := strconv.ParseFloat(groups[1], 64) // <== change this
if err != nil {
return 0, fmt.Errorf("error converting %s to int: %v\n", line, err)
}
return res, nil
}
}
return 0, nil
}
@nunnatsa since the exposition format is always in the format
$METRIC{$LABELS} $VALUE $TIMESTAMP
what do you think about:
parts := strings.Fields(line)
if len(parts) < 2 {
return 0, fmt.Errorf("metric line does not contain a value")
}
res, err := strconv.ParseFloat(strings.TrimSpace(parts[1]), 64)
?
would work for every possible allowed format
ref: https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md
your change makes sense. What I don't understand is how it's worked until now, and still working for other metrics, because we never had issues with the timestamp with the old code of
res, err := strconv.ParseFloat(strings.TrimSpace(strings.TrimPrefix(line, query)), 64)
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws hco-e2e-upgrade-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-aws hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws
In response to this:
hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws hco-e2e-upgrade-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-aws hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: avlitman, nunnatsa
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [avlitman,nunnatsa]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure
In response to this:
hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
hco-e2e-upgrade-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-aws hco-e2e-operator-sdk-azure, hco-e2e-operator-sdk-gcp lanes succeeded. /override ci/prow/hco-e2e-operator-sdk-aws