weave-gitops icon indicating copy to clipboard operation
weave-gitops copied to clipboard

Allow users/community to define healthy status conditions per kubernetes resources per versions

Open Cajga opened this issue 9 months ago • 6 comments

Problem

Currently, weave gitops reports "red" status instead of "green" at graph view for several resource types while the resource is in fact healthy.

Example: HorizontalPodAutoscaler (apiVersion: autoscaling/v2) It has 3 status conditions:

  • AbleToScale: should be True,
  • ScalingActive: should be True,
  • ScalingLimited: should be False

Solution

Define a way/procedure how/where users/projects can define the healthy status of a resource on a specific version. It could be a configuration or sending a PR etc.

Additional context

I would be willing to contribute the definition of several resources if it would be well defined how to do it

Cajga avatar Nov 20 '23 13:11 Cajga

To determine the health of HPAs Weave GitOps compares .status.currentReplicas with .status.desiredReplicas and checks each existing condition for a "Failed" or "Invalid" reason.

@Cajga would you mind posting the complete .status object of your HPA here?

kubectl get hpa -n NAMESPACE HPA -o jsonpath={.status}

makkes avatar Nov 20 '23 13:11 makkes

@makkes thanks for looking into this. Sure, here we are:

# kubectl get hpa -n istio-system istiod -o jsonpath={.status}|jq
{
  "conditions": [
    {
      "lastTransitionTime": "2023-11-17T15:15:50Z",
      "message": "recent recommendations were higher than current one, applying the highest recent recommendation",
      "reason": "ScaleDownStabilized",
      "status": "True",
      "type": "AbleToScale"
    },
    {
      "lastTransitionTime": "2023-11-17T15:16:20Z",
      "message": "the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)",
      "reason": "ValidMetricFound",
      "status": "True",
      "type": "ScalingActive"
    },
    {
      "lastTransitionTime": "2023-11-18T19:10:45Z",
      "message": "the desired count is within the acceptable range",
      "reason": "DesiredWithinRange",
      "status": "False",
      "type": "ScalingLimited"
    }
  ],
  "currentMetrics": [
    {
      "resource": {
        "current": {
          "averageUtilization": 0,
          "averageValue": "3m"
        },
        "name": "cpu"
      },
      "type": "Resource"
    }
  ],
  "currentReplicas": 1,
  "desiredReplicas": 1
}

NOTE: this is a default installation of Istio (with metrics-server) into production, which would scale automatically in case needed

Cajga avatar Nov 20 '23 14:11 Cajga

Let me drop here another example also from Istio's default installation:

# kubectl get poddisruptionbudgets.policy -n istio-system istiod -o jsonpath={.status}|jq
{
  "conditions": [
    {
      "lastTransitionTime": "2023-11-17T15:15:45Z",
      "message": "",
      "observedGeneration": 1,
      "reason": "InsufficientPods",
      "status": "False",
      "type": "DisruptionAllowed"
    }
  ],
  "currentHealthy": 1,
  "desiredHealthy": 1,
  "disruptionsAllowed": 0,
  "expectedPods": 1,
  "observedGeneration": 1
}

While we could argue on the fact that this shows that "disruption would not be allowed in this case" but this is still a healthy installation of Istio and the "red status" of the poddisruptionbudget does not look very nice on the graph

Cajga avatar Nov 20 '23 14:11 Cajga

@makkes hmm... looking into the code, it seems, your test data for HPA is in fact does not look good:

    message: the desired replica count is less than the minimum replica count
    reason: TooFewReplicas
    status: "True"
    type: ScalingLimited

I believe, that this means basically that the HPA would like to scale down but it reached the minReplicas. You should take action and reduce the minReplicas to allow it to scale down... Red Hat guys have a good documentation about this.

Cajga avatar Nov 20 '23 15:11 Cajga

Thanks for raising the issue!

Sounds like HPA health checking could be improved.

  • Doing good health checking for built in k8s resources and at least flux resources too would be great to maintain and we might not need an extensible system for this.
  • Having a more extensible system that allows declaring red/green mapping for less common CustomResources would be neat but need some thought

weave gitops reports "red" status instead of "green" at graph view for several resource types

Are the other resource types CustomResources or builtin k8s resources?

foot avatar Nov 24 '23 17:11 foot

Hi @foot,

Sorry for not coming back. We stopped using/evaluating waeve-gitops as it does not support flux multi-tenant config (more details in this ticket).

As far as I remember there were few more resources reported red in our env but unfortunately, cannot recall which ones.

Cajga avatar Mar 13 '24 12:03 Cajga