kube-state-metrics kube_state_metrics_watch_total will report as success even it fails to watch

kube_state_metrics_watch_total will report as success even it fails to watch

Open courageJ opened this issue 1 year ago • 2 comments

What happened: kube_state_metrics_watch_total is increased with result as "success" while in kube-state-metrics log, it showed error says Failed to watch *v1.Pod: failed to list *v1.Pod: Get "...": dial tcp ...: i/o timeout

What you expected to happen: If it fails to watch, the result is expected to be "error" so we can understand kube-state-metrics behaviour.

How to reproduce it (as minimally and precisely as possible): Deploy a kube-state-metrics in the cluster and make apiserver not available

Anything else we need to know?:

Some deep dive in the implementation The metric will increase when it receives error from the cache reflector (mainly connection error), but the reflector ignores the error and pass nil to kube-state-metrics watch call. It may be designed on purpose in the reflector. For comparison, the list call will always return error no matter what error it is

Our use case

We use this metric to define SLI for the component to understand whether kube-state-metrics is exporting the metrics as expected. Since there is no observability about how many metrics are expected to be generated from kube-state-metrics, we fall back to use list_total and watch_total to rough evaluate the behaviour.

Other use cases

The metrics are also used in https://monitoring.mixins.dev/kube-state-metrics/

Environment:

kube-state-metrics version: v2.7.0
Kubernetes version (use kubectl version): NA
Cloud provider or hardware configuration: NA
Other info: NA

Oct 18 '23 21:10 courageJ

/assign @rexagod /triage accepted

Oct 19 '23 16:10 dashpole

kube-state-metrics kube-state-metrics copied to clipboard

kube_state_metrics_watch_total will report as success even it fails to watch

kube-state-metrics
kube-state-metrics copied to clipboard