watermarkpodautoscaler icon indicating copy to clipboard operation
watermarkpodautoscaler copied to clipboard

Add prometheus wpa controller reconcile and wpa valid metrics

Open khewonc opened this issue 2 years ago • 0 comments

What does this PR do?

Adds 2 metrics:

  • wpa_controller_reconcile_error: 1 with tag reason:<short_error_message> if the last reconcile results in an error. If there is no error, no metric is reported.
  • wpa_controller_reconcile_success: Reports 1 if the last reconcile is successful, 0 if there was an error.

Example:

# HELP wpa_controller_reconcile_error Gauge indicating whether the last recorded reconcile gave an error
# TYPE wpa_controller_reconcile_error gauge
wpa_controller_reconcile_error{reason="failed_compute_replicas",resource_kind="Deployment",resource_name="redis",resource_namespace="default",wpa_name="four",wpa_namespace="default"} 1
# HELP wpa_controller_reconcile_success Gauge indicating whether the last recorded reconcile is successful
# TYPE wpa_controller_reconcile_success gauge
wpa_controller_reconcile_success{resource_kind="Deployment",resource_name="redis",resource_namespace="default",wpa_name="four",wpa_namespace="default"} 0```

Motivation

More ways to track WPA and WPA controller errors. There's a controller_runtime_reconcile_total metric, but that only has the labels controller and result, which don't give much detail about the reconcile error.

Describe your test plan

Set up a WPA(s) (example). The WPA should be valid, the target resource should be present, and the Datadog metric should be present and reporting consistently. The wpa_controller_reconcile_success metric should be present with value 1 and the following labels:

  • resource_kind
  • resource_name
  • resource_namespace
  • wpa_name
  • wpa_namespace

To visualize metrics, either collect them via the node agent with the prometheus or openmetrics check (example), or check the /metrics endpoint:

  • Example request to curl /metrics endpoint: kubectl exec -it <wpa_controller> -- curl localhost:8383/metrics
  • Example using Datadog graphs: error metric, success metric

Update the WPA (and/or target resource) to force an error (examples below) and ensure that the wpa_controller_reconcile_success metric reports 0 and that the wpa_controller_reconcile_error metric reports 1 with the appropriate reason tag. There shouldn't be any stale metrics; the metrics should update accordingly when going from an error to an ok state, error to another error state, and when the WPA is deleted.

This isn't inclusive of all possible errors (and reason values), but here's a list of a few ways to force some errors:

  • Use a Datadog metric name that doesn't exist in the account, e.g. system.load.1.invalid. This should give metrics with reason:failed_compute_replicas and Failed to compute desired number of replicas based on listed metrics. logs in the controller pod.
  • Trigger a parsing error with spec.scaleTargetRef.apiVersion to get reason:invalid_api_version:
spec:
(...)
  scaleTargetRef:
    kind: "Deployment"
    name: "redis"
    apiVersion: "apps/v1///"
  • Use a mismatched spec.scaleTargetRef.apiVersion. For example, extensions/v1beta1 for Deployments looks to have been deprecated in v1.16 so on a newer Kubernetes cluster with the target Deployment using the apps/v1 apiVersion, using the old apiVersion results in reason:unknown_resource and the log line unable to determine resource for scale target reference:
spec:
(...)
  scaleTargetRef:
    kind: "Deployment"
    name: "redis"
    apiVersion: "extensions/v1beta1"
  • Make the spec.minReplicas larger than the spec.maxReplicas to show the log message Invalid WPA specification: watermark pod autoscaler requires the minimum number of replicas to be configured and inferior to the maximum and the tag reason:invalid_wpa_spec:
spec:
(...)
  minReplicas: 3
  maxReplicas: 1

khewonc avatar Mar 10 '22 15:03 khewonc