watermarkpodautoscaler
watermarkpodautoscaler copied to clipboard
Add prometheus wpa controller reconcile and wpa valid metrics
What does this PR do?
Adds 2 metrics:
-
wpa_controller_reconcile_error
:1
with tagreason:<short_error_message>
if the last reconcile results in an error. If there is no error, no metric is reported. -
wpa_controller_reconcile_success
: Reports1
if the last reconcile is successful,0
if there was an error.
Example:
# HELP wpa_controller_reconcile_error Gauge indicating whether the last recorded reconcile gave an error
# TYPE wpa_controller_reconcile_error gauge
wpa_controller_reconcile_error{reason="failed_compute_replicas",resource_kind="Deployment",resource_name="redis",resource_namespace="default",wpa_name="four",wpa_namespace="default"} 1
# HELP wpa_controller_reconcile_success Gauge indicating whether the last recorded reconcile is successful
# TYPE wpa_controller_reconcile_success gauge
wpa_controller_reconcile_success{resource_kind="Deployment",resource_name="redis",resource_namespace="default",wpa_name="four",wpa_namespace="default"} 0```
Motivation
More ways to track WPA and WPA controller errors. There's a controller_runtime_reconcile_total
metric, but that only has the labels controller
and result
, which don't give much detail about the reconcile error.
Describe your test plan
Set up a WPA(s) (example). The WPA should be valid, the target resource should be present, and the Datadog metric should be present and reporting consistently. The wpa_controller_reconcile_success
metric should be present with value 1
and the following labels:
-
resource_kind
-
resource_name
-
resource_namespace
-
wpa_name
-
wpa_namespace
To visualize metrics, either collect them via the node agent with the prometheus
or openmetrics
check (example), or check the /metrics
endpoint:
- Example request to curl
/metrics
endpoint:kubectl exec -it <wpa_controller> -- curl localhost:8383/metrics
- Example using Datadog graphs: error metric, success metric
Update the WPA (and/or target resource) to force an error (examples below) and ensure that the wpa_controller_reconcile_success
metric reports 0
and that the wpa_controller_reconcile_error
metric reports 1
with the appropriate reason
tag. There shouldn't be any stale metrics; the metrics should update accordingly when going from an error to an ok state, error to another error state, and when the WPA is deleted.
This isn't inclusive of all possible errors (and reason
values), but here's a list of a few ways to force some errors:
- Use a Datadog metric name that doesn't exist in the account, e.g.
system.load.1.invalid
. This should give metrics withreason:failed_compute_replicas
andFailed to compute desired number of replicas based on listed metrics.
logs in the controller pod. - Trigger a parsing error with
spec.scaleTargetRef.apiVersion
to getreason:invalid_api_version
:
spec:
(...)
scaleTargetRef:
kind: "Deployment"
name: "redis"
apiVersion: "apps/v1///"
- Use a mismatched
spec.scaleTargetRef.apiVersion
. For example,extensions/v1beta1
for Deployments looks to have been deprecated in v1.16 so on a newer Kubernetes cluster with the target Deployment using theapps/v1
apiVersion, using the old apiVersion results inreason:unknown_resource
and the log lineunable to determine resource for scale target reference
:
spec:
(...)
scaleTargetRef:
kind: "Deployment"
name: "redis"
apiVersion: "extensions/v1beta1"
- Make the
spec.minReplicas
larger than thespec.maxReplicas
to show the log messageInvalid WPA specification: watermark pod autoscaler requires the minimum number of replicas to be configured and inferior to the maximum
and the tagreason:invalid_wpa_spec
:
spec:
(...)
minReplicas: 3
maxReplicas: 1