kubefed icon indicating copy to clipboard operation
kubefed copied to clipboard

fix: FailureThreshold and SuccessThreshold not take effect

Open FengXingYuXin opened this issue 2 years ago • 11 comments

What this PR does / why we need it: cluster health check's threshold config not take effect. Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #1496

Special notes for your reviewer:

FengXingYuXin avatar Mar 09 '22 11:03 FengXingYuXin

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: FengXingYuXin / name: FengXingYuXin (b85d3904ea6e657533182ec71d09fcd5982695c8)

Welcome @FengXingYuXin!

It looks like this is your first PR to kubernetes-sigs/kubefed 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubefed has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Mar 09 '22 11:03 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: FengXingYuXin To complete the pull request process, please assign hectorj2f after the PR has been reviewed. You can assign the PR to them by writing /assign @hectorj2f in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Mar 09 '22 11:03 k8s-ci-robot

cc @irfanurrehman @hectorj2f Could you please take a look?

RainbowMango avatar Mar 21 '22 01:03 RainbowMango

@FengXingYuXin Thanks for doing this. Will it be possible for you to add some kind of a test for this change?

irfanurrehman avatar Mar 22 '22 05:03 irfanurrehman

@FengXingYuXin Thanks for doing this. Will it be possible for you to add some kind of a test for this change?

@FengXingYuXin nevermind, there are unit tests for this and they seem to fail with your change. Please take a look.

irfanurrehman avatar Mar 22 '22 09:03 irfanurrehman

@FengXingYuXin Thanks for doing this. Will it be possible for you to add some kind of a test for this change?

@FengXingYuXin nevermind, there are unit tests for this and they seem to fail with your change. Please take a look.

@irfanurrehman thanks for your reply and remind, and I have fix the ut cases, please check it in your own good time.

FengXingYuXin avatar Mar 23 '22 07:03 FengXingYuXin

@irfanurrehman about the moment of status translates, for example, if fail threshold is 3, as I understand it, if it's better for translating status from ready to notReady when probe failed status on the third time continuously, now the existing code translates status on the 4th time. If you agree with it, I can adjust it later.

FengXingYuXin avatar Mar 25 '22 01:03 FengXingYuXin

@irfanurrehman about the moment of status translates, for example, if fail threshold is 3, as I understand it, if it's better for translating status from ready to notReady when probe failed status on the third time continuously, now the existing code translates status on the 4th time. If you agree with it, I can adjust it later.

@FengXingYuXin Apologies for late reply. I had a chance to look at your changes this weekend. Thanks for the same. I meanwhile find the change a little unclean. I also agree that this portion of code might do with an overhaul and the logic in thresholdAdjustedClusterStatus() can be rewritten to make it simpler. To only fix the issue you raised , I implemented a quick fix which seems to work fine with the existing test cases without changing them and also a new test case that I did add to address your issue. If that seems fine to you, please pull the changes from here into your PR.

If you however interested in rewriting the logic and make it easy to understand and more maintainable I recommend below: Keep the current ClusterData.clusterStatus field as is and use it for last sampling data updated each time.

Introduce a new field in cluster data:

	// clusterStatus of the last observed transition.
	transitionStatus *fedv1b1.KubeFedClusterStatus

and use that to store the observed transition when it is observed the first time. Recommendation on initiation of the code below (you will need to complete the logic and update the tests accordingly):


    if storedData.clusterStatus == nil {
        storedData.resultRun = 1
        return clusterStatus
    }

    threshold := clusterHealthCheckConfig.FailureThreshold
    if util.IsClusterReady(clusterStatus) {
        threshold = clusterHealthCheckConfig.SuccessThreshold
    }

    if !clusterStatusEqual(clusterStatus, storedData.clusterStatus) {
    // We observe a transition
        if storedData.transitionStatus == nil {
            // This is the first time we observe the transition
            storedData.transitionStatus = clusterStatus
            storedData.resultRun = 1
        }
        if storedData.resultRun < threshold {
            // Success/Failure is below threshold - leave the probe state unchanged.
            probeTime := clusterStatus.Conditions[0].LastProbeTime
            clusterStatus = storedData.clusterStatus
            setProbeTime(clusterStatus, probeTime)
            if storedData.transitionStatus != nil {
                storedData.resultRun++
            }
        } 
    } else {
        storedData.resultRun++
    }

irfanurrehman avatar Apr 11 '22 12:04 irfanurrehman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 10 '22 12:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 09 '22 13:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 08 '22 14:09 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 08 '22 14:09 k8s-ci-robot