dashboard Dashboard fails to load when metrics-scraper is overwhelmed

What happened?

When metrics-scraper is overwhelmed running at 100% CPU limits, the dashboard fails to load resources. We started seeing 5xx errors and users were not able to see pods in the dashboard. The blue spinner would continue to spin. In the network logs, responses for pods JSON was not coming back in time before a retry was issued.

What did you expect to happen?

The dashboard would still load but metrics would not appear.

How can we reproduce it (as minimally and precisely as possible)?

Give metrics-scaper a very small CPU limit (10m?) and make a lot of requests to the Kubernetes Dashboard.

Anything else we need to know?

A possible solution would be to forgo metrics but still load other resources when a query to metrics-server times out.

What browsers are you seeing the problem on?

Chrome

Kubernetes Dashboard version

dashboard -> v2.6.1, metrics-scraper -> v1.0.8

Kubernetes version

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.13", GitCommit:"80ec6572b15ee0ed2e6efa97a4dcd30f57e68224", GitTreeState:"clean", BuildDate:"2022-05-24T12:34:37Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Dev environment

No response

Oct 07 '22 20:10 jrcichra

Actually I think that the issue here is that the api response time exceeds frontend refresh timer. Since we are using a polling mechanism that does the refresh every X seconds it simply cancels the previous request if there is no response and makes a new request. One way to solve that on your side would be going to the Dashboard settings page and increasing the auto refresh time. This is highly recommended for bigger clusters with lots of resources since the default 5s would often not be enough to even load a list of resources.

Oct 08 '22 11:10 floreks

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 06 '23 11:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 05 '23 12:02 k8s-triage-robot

dashboard dashboard copied to clipboard

Dashboard fails to load when metrics-scraper is overwhelmed

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

What browsers are you seeing the problem on?

Kubernetes Dashboard version

Kubernetes version

Dev environment

dashboard
dashboard copied to clipboard