grafana-dashboards-kubernetes View Pods Dashboard Feature Requests / Issues

RAM Usage Request Gauge My understanding of requests is that this should closely match the actual. Being 90% of Request is not a bad condition, that is a good condition. I think GREEN should be +/- 20% of the request value. 20% beyond that either side yellow, and the rest is red as being signification under or over request is not ideal. As it is now if you estimate the request perfectly it shows RED like an error condition and that is not the case. Only the LIMIT gauge should be like this (as you get OOM killed),

I think that is wrong, to be stable at 90% of request should get me a gold star :)

I'm not sure if CPU Request needs that as well. If so maybe its GREEN range is wider?!?

Resource by container Could you add the Actual Usage for CPU and Memory between Request/Limits for each? That would be helpful to show where actual is between the two values.

I think CPU Usage by container and Memory Usage by Container should be renamed to by pod as if you select a Pod with multiple containers, you do not get a graph with multiple plot lines which you would expect if it was by container.

NOTE: I played with adding resource requests and limits as plot lines for CPU Usage by Container and Memory Usage by Container and looks good for pods with a single container. But once I selected a pod with multiple containers and thus multiple requests/limits it become confusing mess. Don't have the Grafana skills to isolate them properly. But maybe you have some ideas to make that work right.

Jul 25 '22 17:07 reefland

Really interesting point!

On my setups, I try to have most of my pods in the 50-80% range, I then consider them to be correctly sized. In my experience, you can start having reliability issues and weird behaviors above 80% resource usage. I also consider pods running under 50% usage to be over-sized.

I decided to go for a "standard" color scheme for theses because I think It's what makes sense for most users. We need to keep in mind that requests could also go above 100% if the limit is higher, so you could have something like red > yellow > green > red and I think it can be really confusing for users. We could also argue on the thresholds themselves, this depends on everyone use-cases and policies.

Other ideas would be to use a single color, or another color scheme (not green, yellow and red), but I think it's just a little bit weird... Users like you that know what's best for their use-cases will just ignore the color anyway, so it's not a big deal in my opinion.

Keeping it this way is maybe safer for most users, what do you think?

If anyone wants to comment with thoughts or ideas, I think it's a good topic! :blush:

Aug 09 '22 08:08 dotdc

For the second part:

Yes It's a good idea to add the real usage in the table, will make a PR this week to add this
On Kubernetes the resources are set by containers not by pods, so I think it can only be "by container".

If you have a pod with more than one container, you should have one plot line per container like this:

For the last point, I know It could be confusing depending of the pods/containers configuration but didn't find a way to make it more readable than this.

Good to know:

I mostly use this dashboard to size my pods based on average or peak usage
The table can really help you understand what's wrong with your setup (see screenshot above)
Gauges could be hard to read if requests and limits are not set the the same way on all containers
The requests gauges can disappear if no requests are set

A nice (but old) thread by @thockin on limits : https://www.reddit.com/r/kubernetes/comments/all1vg/comment/efgyygu/

Aug 09 '22 09:08 dotdc

On my setups, I try to have most of my pods in the 50-80% range, I then consider them to be correctly sized. In my experience, you can start having reliability issues and weird behaviors above 80% resource usage. I also consider pods running under 50% usage to be over-sized.

Not clear if you target your pods to be 50-80% of the LIMIT or REQUEST. I try to target to within 20% of the REQUEST as ideal. If its constantly over the request (20%+) then I would bump that up when tuning as clearly the request I asked for was too low. The LIMIT I want within 50%-70% as a starting point to avoid OOM kills and leave wiggle room.

I decided to go for a "standard" color scheme for theses because I think It's what makes sense for most users. We need to keep in mind that requests could also go above 100% if the limit is higher, so you could have something like red > yellow > green > red and I think it can be really confusing for users. We could also argue on the thresholds themselves, this depends on everyone use-cases and policies.

I don't think that is confusing. The request number should be center point of GREEN, left and right of center is an arbitrary number we pick that feels right... +/- 25% from center ??. This defines the green area. Then 20% either side of that would be yellow and the last 5% either side is red. If you are significantly under or over the request, that is a problem.

I think its more confusing now as new users will see a very good request value as RED, be confused and alter the values to get it GREEN which really is not what they should be doing.

Other ideas would be to use a single color, or another color scheme (not green, yellow and red), but I think it's just a little bit weird... Users like you that know what's best for their use-cases will just ignore the color anyway, so it's not a big deal in my opinion.

I've been trying to use Goldilocks to get an idea for requests and limits and its values are all over the map. Pretty much every time you hit refresh you get a different recommendation. I found using your dashboard to be WAY easier to tune with. It's just the request colors are off, you need to know that, and not use the colors to base your tuning. But if we can correct the colors, I think it would be an excellent tool for this.

Keeping it this way is maybe safer for most users, what do you think?

I think no color vs current color pattern is safer. The way is is now, I think encourages the wrong action to make it green. But I don't want no color :(

This is how I think it should look:

You're a bit over, still ok, should not be red:

Significantly under should indicate you can improve:

Aug 09 '22 16:08 reefland

For above, changes I made to graph:

Standard Options
- Min: auto (but zero looks good to, not sure of difference)
- Max: 2
- Decimals: 1

And thresholds:

I'd also like to see a timeline graph of each CPU and RAM usage plotted with with respective request / limit lines plotted on it. This would allow an overall view over time (Last 1 hour, 6 hours, 2 days, etc).

Aug 09 '22 16:08 reefland

Thank you for this @reefland, you just shared many good points and ideas! I'm still unsure for requests to be honest because it highly depend on how you manage your kubernetes resources (requests = limits, requests < limits...) So I would still keep them neutral for now but we can still iterate on this.

I just created a new version (didn't commit yet):

Switched to blue color for requests (pod total) and left limits with green, yellow & red
Added "Used" CPU & Memory in the table
Added 2 new panels with % usage on requests & limits with thresholds as colored areas

The rest of the dashboard is left unchanged.

Used 20% 30%, 70% & 80% as thresholds, as I think it's pretty conservative.

What do you you think?

Screenshots:

Aug 10 '22 19:08 dotdc

Yeah! These look neat! Look forward to trying them.

Aug 11 '22 17:08 reefland

Just pushed the new version, try it and let me know what you think. Maybe we can do a pros/cons list for the requests colors?

Aug 11 '22 19:08 dotdc

ok, I'll check it out this weekend!

Do you have any way to determine if request = limit then make it blue, otherwise use color scale like something I suggested?

Aug 11 '22 20:08 reefland

I need to figure out this missing image= key. As-is, I get nothing. I'll have to re-work each gauge to remove that reference.

Aug 16 '22 13:08 reefland

sigh another issue, besides not having the image= do not have container=

The container_cpu_usage_seconds_total{namespace="mosquitto", pod="mosquitto-mqtt-0"} yields:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/burstable/podcc153a2a-d87e-4b18-b37b-159fa6907cd4", instance="k3s02", job="kubelet", metrics_path="/metrics/cadvisor", namespace="mosquitto", node="k3s02", pod="mosquitto-mqtt-0", service="prometheus-kubelet"}

Which returns an empty set using by (container):

sum(rate(container_cpu_usage_seconds_total{namespace="mosquitto", pod="mosquitto-mqtt-0"}[1m])) by (container)

Aug 16 '22 15:08 reefland

Ok I think it's time to run copy of your k3s setup to solve both of theses. I'll do my best to do it this week or during the weekend. Will keep you updated, hopefully with a fix.

Aug 16 '22 17:08 dotdc

We'll keep this issue on topic. Investigation on missing labels will be in https://github.com/dotdc/grafana-dashboards-kubernetes/issues/18

Aug 17 '22 14:08 dotdc

Did you manage to test the latest version? I think it now includes most of what we discussed in this issue. Let me know.

Sep 12 '22 20:09 dotdc

Nah... without the container level metrics I can't really test it properly.

Sep 16 '22 01:09 reefland

Hope you will find a solution to get this working on your setup :crossed_fingers: Thanks again for your time and ideas on this! Closing this issue.

Sep 27 '22 20:09 dotdc

grafana-dashboards-kubernetes grafana-dashboards-kubernetes copied to clipboard

View Pods Dashboard Feature Requests / Issues

grafana-dashboards-kubernetes
grafana-dashboards-kubernetes copied to clipboard