metrics-server Is workingSetBytes of 0 really an indication of a terminated process?

What happened:

We observe sometimes metrics stop flowing for a Pod when one of the containers start reporting 0 as a working_set_bytes.

The container in question is doing nothing but sleep for hours and then check on some files to do some work. We do not have direct access to the repro.

The check that discards Pod metrics in this case is here:

https://github.com/kubernetes-sigs/metrics-server/blob/ffbdb6fe2f56848dd022ba4efc9367c57006363c/pkg/scraper/client/resource/decode.go#L195-L198

This code was introduced by https://github.com/kubernetes-sigs/metrics-server/pull/759 to filter out situations when the Container is terminated already. I would question the reasoning for throwing the entire Pod stats away when container is terminated. How would it work in case of Jobs with two containers that are not restarteable and one of them already terminated.

I also found similar logic and this comment in heapster repo: https://github.com/kubernetes-retired/heapster/pull/1708#discussion_r126522877 that says (as expected) that CPU alone is a bad indicator of bad container. Both memory and cpu needs to be checked, while this PR: https://github.com/kubernetes-sigs/metrics-server/pull/759 ignores container if either is zero.

What you expected to happen:

0 workingSetBytes should not result in missing Pod measurements.

Anything else we need to know?:

One questions I still cannot confirm and repro with confidence is whether 0 workingSetBytes is a legitimate situation. I tried to create a small container and a memory pressure, but wasn't able to drive workingSetBytes = usage - total_inactive_memory to 0. It went quite low, but not zero. If we have a repro for this, than the behavior in this repo is definitely incorrect.

If this is impossible, than he logic may be legit and there is a bug somewhere else. I am opening this issue for the discussion in case community wisdom can help understand this better.

/sig node

Environment:

Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
Container Network Setup (flannel, calico, etc.):
Kubernetes version (use kubectl version):
Metrics Server manifest

spoiler for Metrics Server manifest:

Kubelet config:

spoiler for Kubelet config:

Metrics server logs:

spoiler for Metrics Server logs:

Status of Metrics API:

spolier for Status of Metrics API:

kubectl describe apiservice v1beta1.metrics.k8s.io

/kind bug

Sep 15 '23 20:09 SergeyKanzhelev

This is how working set bytes are calculated: https://github.com/google/cadvisor/blob/fbd519ba03978d54cb54ea7ed8ab9d6e3dd64590/container/libcontainer/handler.go#L831-L844

Sep 15 '23 20:09 SergeyKanzhelev

/assign @dgrisonnet /triage accepted

Thanks @SergeyKanzhelev

Sep 21 '23 16:09 dashpole

I'd add the question if metrics-server should be the one deciding when a pod is terminated? I would have assumed it determines that based on kubelet container status, not based on metrics. Not sure if that's just not possible or why it wasn't chosen back then.

Sep 29 '23 22:09 kwiesmueller

I think it is because the experience is based on parsing prometheus metrics. I'm not sure if pod status is available.

We had a short discussion with @mrunalp. We still need to understand exactly how this happens. Once we know, and we confirm that this is expected, we may need to expose additional metrics to help understand that that 0 means. But again, we will need to fully understand this first.

Sep 29 '23 23:09 SergeyKanzhelev

/cc @serathius

Oct 04 '23 11:10 pacoxu

Metrics server treats 0 in WSS as undefined behavior and skips reporting pod to avoid autoscaling to making a incorrect decision. Better to not make a decision then making a bad one.

Oct 04 '23 12:10 serathius

@serathius I think in principle that makes sense, however in practice we have seen a case where a single container in a pod with workingSetBytes = 0 (this container is not part of the main application and really is not relevant to the overall pod performance) is preventing the entire pod from being reported and therefore prevents autoscaling.

I'm curious if any progress has been made on the investigation here or if we have a theory as to how this can happen?

Maybe a middle ground here is to just drop the single bad container's metrics?

Oct 11 '23 19:10 raywainman

It's just a status quo, if someone can bring a nicely documented case that would show that kernel can report workingSetBytes=0 and that's a correct state we can switch it. We could also consider leaving the decision up to user.

On the other hand, what will HPA do if there are 10 pods under it and one of them has issue. Can hpa make a decision based on remaining 9 pods?

Oct 12 '23 13:10 serathius

metrics-server metrics-server copied to clipboard

Is workingSetBytes of 0 really an indication of a terminated process?

metrics-server
metrics-server copied to clipboard