kube-state-metrics
kube-state-metrics copied to clipboard
kube_state_metrics_watch_total will report as success even it fails to watch
What happened:
kube_state_metrics_watch_total is increased with result as "success" while in kube-state-metrics
log, it showed error says Failed to watch *v1.Pod: failed to list *v1.Pod: Get "...": dial tcp ...: i/o timeout
What you expected to happen: If it fails to watch, the result is expected to be "error" so we can understand kube-state-metrics behaviour.
How to reproduce it (as minimally and precisely as possible): Deploy a kube-state-metrics in the cluster and make apiserver not available
Anything else we need to know?:
Some deep dive in the implementation The metric will increase when it receives error from the cache reflector (mainly connection error), but the reflector ignores the error and pass nil to kube-state-metrics watch call. It may be designed on purpose in the reflector. For comparison, the list call will always return error no matter what error it is
Our use case
We use this metric to define SLI for the component to understand whether kube-state-metrics is exporting the metrics as expected. Since there is no observability about how many metrics are expected to be generated from kube-state-metrics, we fall back to use list_total and watch_total to rough evaluate the behaviour.
Other use cases
The metrics are also used in https://monitoring.mixins.dev/kube-state-metrics/
Environment:
- kube-state-metrics version: v2.7.0
- Kubernetes version (use
kubectl version
): NA - Cloud provider or hardware configuration: NA
- Other info: NA
/assign @rexagod /triage accepted