hyperconverged-cluster-operator icon indicating copy to clipboard operation
hyperconverged-cluster-operator copied to clipboard

Add OperatorConditionsUnhealthy alerts

Open machadovilaca opened this issue 1 year ago • 10 comments

What this PR does / why we need it:

Add alerts based on kubevirt_hco_system_health_status

The alert includes the reason of the condition that sets the health to a given status. For that, kubevirt_hco_system_health_status was updated to add a reason label, that has the value of the reason field in the underlying condition

Reviewer Checklist

Reviewers are supposed to review the PR for every aspect below one by one. To check an item means the PR is either "OK" or "Not Applicable" in terms of that item. All items are supposed to be checked before merging a PR.

  • [ ] PR Message
  • [ ] Commit Messages
  • [ ] How to test
  • [ ] Unit Tests
  • [ ] Functional Tests
  • [ ] User Documentation
  • [ ] Developer Documentation
  • [ ] Upgrade Scenario
  • [ ] Uninstallation Scenario
  • [ ] Backward Compatibility
  • [ ] Troubleshooting Friendly

Jira Ticket:

https://issues.redhat.com/browse/CNV-48799

Release note:

Add OperatorConditionsUnhealthy alerts

machadovilaca avatar Oct 21 '24 10:10 machadovilaca

Pull Request Test Coverage Report for Build 11723222014

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 59 of 59 (100.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 72.595%

Totals Coverage Status
Change from base Build 11611012763: 0.1%
Covered Lines: 6082
Relevant Lines: 8378

💛 - Coveralls

coveralls avatar Oct 21 '24 11:10 coveralls

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

hco-bot avatar Oct 21 '24 15:10 hco-bot

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure

In response to this:

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar Oct 21 '24 15:10 kubevirt-bot

@machadovilaca In this PR you are adding a new alert but also updating the metrics with the reason. Please explain in the description how the reason is calculated.

sradco avatar Oct 28 '24 12:10 sradco

@machadovilaca In this PR you are adding a new alert but also updating the metrics with the reason. Please explain in the description how the reason is calculated.

@sradco added

machadovilaca avatar Oct 28 '24 14:10 machadovilaca

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

hco-bot avatar Oct 29 '24 17:10 hco-bot

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure

In response to this:

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar Oct 29 '24 17:10 kubevirt-bot

@machadovilaca can you please add an example for the metrics with the new reason label? And explain what happens if there are a few conditions that are not healthy.

sradco avatar Oct 31 '24 10:10 sradco

@avlitman please review

sradco avatar Oct 31 '24 10:10 sradco

@sradco - it currently does not pass CI - including sanity test.

I think it needs at least #3153

nunnatsa avatar Oct 31 '24 11:10 nunnatsa

@machadovilaca, I find the name of the alert confusing. The term OperatorCondition is used for something else in OLM, and it may misguide users.

nunnatsa avatar Oct 31 '24 11:10 nunnatsa

@machadovilaca, I find the name of the alert confusing. The term OperatorCondition is used for something else in OLM, and it may misguide users.

@nunnatsa how would you call this alert? It is firing based on the HCO CR conditions.

sradco avatar Nov 04 '24 11:11 sradco

@machadovilaca, I find the name of the alert confusing. The term OperatorCondition is used for something else in OLM, and it may misguide users.

@nunnatsa how would you call this alert? It is firing based on the HCO CR conditions.

Maybe HCOOperatorConditionUnhealthy or HyperConvergedConditionsUnhealthy?

nunnatsa avatar Nov 04 '24 12:11 nunnatsa

@machadovilaca, I find the name of the alert confusing. The term OperatorCondition is used for something else in OLM, and it may misguide users.

@nunnatsa how would you call this alert? It is firing based on the HCO CR conditions.

Maybe HCOOperatorConditionUnhealthy or HyperConvergedConditionsUnhealthy?

I support HCOOperatorConditionUnhealthy name.

avlitman avatar Nov 04 '24 13:11 avlitman

@machadovilaca need to rebase (operatorrules should be operatorRegistry) and change alert name. other than that looks good to me

avlitman avatar Nov 04 '24 13:11 avlitman

@nunnatsa @machadovilaca I would like to use a generic alert name that can be used by other operators. That is why HCO is not in the alert name. The operator name is in a label.

sradco avatar Nov 04 '24 15:11 sradco

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

hco-bot avatar Nov 04 '24 19:11 hco-bot

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure

In response to this:

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar Nov 04 '24 19:11 kubevirt-bot

/approve

avlitman avatar Nov 05 '24 11:11 avlitman

@machadovilaca - the "should fired when KubeDescheduler is installed and not properly configured for KubeVirt" e2e test is failing.

The reason for that is the addition of the reason label, while in HCOPrometheusClient::GetHCOMetric(), we only remove the query text from the result metric line, and then trying to convert text into int, but we still have the label in the text so the conversion fails.

I suggest to change tests/func-tests/hco_prometheus_route.go:

var metricValueRegex = regexp.MustCompile(` (\d+(?:\.\d+)?)$`) // <== add this

func (hcoCli HCOPrometheusClient) GetHCOMetric(ctx context.Context, query string) (float64, error) {
	req, err := http.NewRequest(http.MethodGet, hcoCli.url, nil)
	if err != nil {
		return 0, fmt.Errorf("failed to create request: %w", err)
	}

	resp, err := hcoCli.cli.Do(req.WithContext(ctx))
	if err != nil {
		return 0, err
	}
	defer resp.Body.Close()
	if resp.StatusCode != http.StatusOK {
		return 0, fmt.Errorf("failed to read the temp route status: %s", resp.Status)
	}

	scanner := bufio.NewScanner(resp.Body)
	for scanner.Scan() {
		line := scanner.Text()
		if strings.HasPrefix(line, query) {
			groups := metricValueRegex.FindStringSubmatch(line)  // <== add this
			if len(groups) != 2 { // <== add this
				return 0, fmt.Errorf("no numeric value in: %q", line) // <== add this
			}
			
			res, err := strconv.ParseFloat(groups[1], 64) // <== change this
			if err != nil {
				return 0, fmt.Errorf("error converting %s to int: %v\n", line, err)
			}
			return res, nil
		}
	}
	return 0, nil
}

nunnatsa avatar Nov 07 '24 06:11 nunnatsa

@nunnatsa since the exposition format is always in the format

$METRIC{$LABELS} $VALUE $TIMESTAMP

what do you think about:

			parts := strings.Fields(line)
			if len(parts) < 2 {
				return 0, fmt.Errorf("metric line does not contain a value")
			}
			res, err := strconv.ParseFloat(strings.TrimSpace(parts[1]), 64)

?

would work for every possible allowed format

ref: https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md

machadovilaca avatar Nov 07 '24 11:11 machadovilaca

your change makes sense. What I don't understand is how it's worked until now, and still working for other metrics, because we never had issues with the timestamp with the old code of

res, err := strconv.ParseFloat(strings.TrimSpace(strings.TrimPrefix(line, query)), 64)

nunnatsa avatar Nov 07 '24 11:11 nunnatsa

hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws hco-e2e-upgrade-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-aws hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws

hco-bot avatar Nov 07 '24 15:11 hco-bot

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws

In response to this:

hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws hco-e2e-upgrade-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-aws hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar Nov 07 '24 15:11 kubevirt-bot

/lgtm /approve

nunnatsa avatar Nov 07 '24 15:11 nunnatsa

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: avlitman, nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [avlitman,nunnatsa]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot avatar Nov 07 '24 15:11 kubevirt-bot

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

hco-bot avatar Nov 07 '24 16:11 hco-bot

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure

In response to this:

hco-e2e-kv-smoke-gcp lane succeeded. /override ci/prow/hco-e2e-kv-smoke-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar Nov 07 '24 16:11 kubevirt-bot

hco-e2e-upgrade-operator-sdk-azure lane succeeded. /override ci/prow/hco-e2e-upgrade-operator-sdk-aws hco-e2e-operator-sdk-azure, hco-e2e-operator-sdk-gcp lanes succeeded. /override ci/prow/hco-e2e-operator-sdk-aws

hco-bot avatar Nov 07 '24 16:11 hco-bot