dashboard
dashboard copied to clipboard
Dashboard fails to load when metrics-scraper is overwhelmed
What happened?
When metrics-scraper
is overwhelmed running at 100% CPU limits, the dashboard fails to load resources. We started seeing 5xx errors and users were not able to see pods in the dashboard. The blue spinner would continue to spin. In the network logs, responses for pods JSON was not coming back in time before a retry was issued.
What did you expect to happen?
The dashboard would still load but metrics would not appear.
How can we reproduce it (as minimally and precisely as possible)?
Give metrics-scaper
a very small CPU limit (10m?) and make a lot of requests to the Kubernetes Dashboard.
Anything else we need to know?
A possible solution would be to forgo metrics but still load other resources when a query to metrics-server
times out.
What browsers are you seeing the problem on?
Chrome
Kubernetes Dashboard version
dashboard -> v2.6.1, metrics-scraper -> v1.0.8
Kubernetes version
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.13", GitCommit:"80ec6572b15ee0ed2e6efa97a4dcd30f57e68224", GitTreeState:"clean", BuildDate:"2022-05-24T12:34:37Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Dev environment
No response
Actually I think that the issue here is that the api response time exceeds frontend refresh timer. Since we are using a polling mechanism that does the refresh every X seconds it simply cancels the previous request if there is no response and makes a new request. One way to solve that on your side would be going to the Dashboard settings page and increasing the auto refresh time. This is highly recommended for bigger clusters with lots of resources since the default 5s would often not be enough to even load a list of resources.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten