GET to Cozystack API takes upward of 600ms

Open lllamnyp opened this issue 2 months ago • 1 comments

Debugging high API latency on a cluster. Audit logs show high latency to the lineage webhook (around 1s). Deploy webhook as a DaemonSet on controlplane nodes, set internal traffic policy to Local. Latency is still high, set webhook KUBERNETES_SERVICE_HOST==hostIP. Further inspection shows high latency to Cozystack API -- 600ms for a GET of a single cozystack app (e.g. kubectl get redis foo). Deploy Cozystack API on controlplane nodes with same env var fixes and set its internal traffic policy to Local as well. Still no cigar.

Cozy API log snippet:

W1017 14:25:27.282475       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:25:56.396161       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
E1017 14:26:05.226406       1 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/apps.cozystack.io/v1alpha1" auditID="44149211-4e78-4c28-901c-f0642f5b98d8"
E1017 14:26:05.226445       1 timeout.go:140] "Post-timeout activity" timeElapsed="4.371µs" method="GET" path="/apis/apps.cozystack.io/v1alpha1" result=null
W1017 14:26:08.883489       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:26:18.669714       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:27:14.858481       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:27:19.876521       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
E1017 14:27:35.228518       1 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/apps.cozystack.io/v1alpha1" auditID="b1ce8b10-f083-47a4-b035-c1537d333e07"
E1017 14:27:35.228567       1 timeout.go:140] "Post-timeout activity" timeElapsed="2.366µs" method="GET" path="/apis/apps.cozystack.io/v1alpha1" result=null
W1017 14:27:56.518420       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
E1017 14:28:05.225490       1 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http2: stream closed"
E1017 14:28:05.225500       1 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/apps.cozystack.io/v1alpha1" auditID="65d0c9d6-414d-4671-a92a-828c7afc7896"
E1017 14:28:05.227531       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http2: stream closed\"}: http2: stream closed"
E1017 14:28:05.228679       1 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
E1017 14:28:05.229848       1 timeout.go:140] "Post-timeout activity" timeElapsed="4.311555ms" method="GET" path="/apis/apps.cozystack.io/v1alpha1" result=null
W1017 14:28:16.428004       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:28:22.460343       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:29:10.620541       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:29:20.103433       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:29:41.521332       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:32:37.156311       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:32:40.917998       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish
W1017 14:32:43.048652       1 rest.go:681] HelmRelease watcher closed, attempting to re-establish

Most likely, there's a problem with the HelmRelease watch.

Oct 17 '25 15:10 lllamnyp

The repeated "HelmRelease watcher closed, attempting to re-establish" log entries and high latency for GET requests are likely due to instability in the HelmRelease watcher implementation. When the underlying watch channel closes, the watcher logs the closure and exits without retrying, which can block or delay API responses until a new watch is created. This pattern can cause timeouts and "http2: stream closed" errors, especially if there is resource contention or slow downstream consumers. Improving watcher resilience (e.g., adding retry logic or buffering) and monitoring resource usage on control plane nodes may help mitigate these issues. For clusters with stretched networking, tuning etcd timeouts can also improve API responsiveness and stability source.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Oct 17 '25 15:10 dosubot[bot]