metrics-server icon indicating copy to clipboard operation
metrics-server copied to clipboard

OOM metrics undetected

Open mikelo opened this issue 3 years ago • 8 comments

What happened: I wanted to deliberately create a pod that would go "out of memory" but it seems to run fine.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch deployment metrics-server -n kube-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"metrics-server","args":["--cert-dir=/tmp", "--secure-port=4443", "--kubelet-insecure-tls","--kubelet-preferred-address-types=InternalIP"]}]}}}}'

apply the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oomkilled
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oomkilled
  template:
    metadata:
      labels:
        app: oomkilled
    spec:
      containers:
      - image: gcr.io/google-containers/stress:v1
        name: stress
        command: [ "/stress"]
        args: 
          - "--mem-total"
          - "104858000"
          - "--logtostderr"
          - "--mem-alloc-size"
          - "10000000"
        resources:
          requests:
            memory: 1Mi
            cpu: 5m
          limits:
            memory: 20Mi

What you expected to happen: the pod should switch to status "OOMKilled" right after starting, but instead it runs fine

Anything else we need to know?: I created a sister issue https://github.com/kubernetes-sigs/kind/issues/2848 which I should close soon.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): kind v0.14.0 go1.18.2 linux/amd64

  • Container Network Setup (flannel, calico, etc.):

  • Kubernetes version (use kubectl version): +WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:30:46Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.0-alpha.0.881+7c127b33dafc53", GitCommit:"7c127b33dafc530f7ca0c165ddb47db86eb45880", GitTreeState:"clean", BuildDate:"2022-07-26T08:01:01Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}

  • Metrics Server manifest

spoiler for Metrics Server manifest:
  • Kubelet config:
spoiler for Kubelet config:
  • Metrics server logs:
spoiler for Metrics Server logs:

I0801 14:51:58.441090 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key) I0801 14:51:59.193821 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I0801 14:51:59.193841 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController I0801 14:51:59.193885 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I0801 14:51:59.193914 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0801 14:51:59.193916 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I0801 14:51:59.193930 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0801 14:51:59.194245 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key" I0801 14:51:59.194357 1 secure_serving.go:266] Serving securely on [::]:4443 I0801 14:51:59.194397 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" W0801 14:51:59.194531 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed I0801 14:51:59.294869 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0801 14:51:59.294910 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController I0801 14:51:59.294938 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file

  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io

Name: v1beta1.metrics.k8s.io Namespace:
Labels: k8s-app=metrics-server Annotations: API Version: apiregistration.k8s.io/v1 Kind: APIService Metadata: Creation Timestamp: 2022-08-01T14:50:58Z Managed Fields: API Version: apiregistration.k8s.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:kubectl.kubernetes.io/last-applied-configuration: f:labels: .: f:k8s-app: f:spec: f:group: f:groupPriorityMinimum: f:insecureSkipTLSVerify: f:service: .: f:name: f:namespace: f:port: f:version: f:versionPriority: Manager: kubectl-client-side-apply Operation: Update Time: 2022-08-01T14:50:58Z API Version: apiregistration.k8s.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: .: k:{"type":"Available"}: .: f:lastTransitionTime: f:message: f:reason: f:status: f:type: Manager: kube-apiserver Operation: Update Subresource: status Time: 2022-08-01T14:52:17Z Resource Version: 124754 UID: ecaf9501-794b-4841-9f33-fd25e7c2cd71 Spec: Group: metrics.k8s.io Group Priority Minimum: 100 Insecure Skip TLS Verify: true Service: Name: metrics-server Namespace: kube-system Port: 443 Version: v1beta1 Version Priority: 100 Status: Conditions: Last Transition Time: 2022-08-01T14:52:17Z Message: all checks passed Reason: Passed Status: True Type: Available Events:

/kind bug

mikelo avatar Aug 02 '22 08:08 mikelo

Sorry. I don't understand how this issue is related to metrics-server. Do you mean the pod uses more memory than the limit, the status should be OOMKilled? This is not a function of metrics-server

yangjunmyfm192085 avatar Aug 02 '22 08:08 yangjunmyfm192085

/kind support /remove-kind bug

yangjunmyfm192085 avatar Aug 02 '22 08:08 yangjunmyfm192085

/cc @sanwishe See if it is related to kubelet?

yangjunmyfm192085 avatar Aug 02 '22 08:08 yangjunmyfm192085

@mikelo Could you please provide the actual utilization of memory of the stress container?

sanwishe avatar Aug 02 '22 10:08 sanwishe

kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/" | jq

{
  "kind": "PodMetricsList",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metadata": {
        "name": "oomkilled-85d9cf68b6-wfrt9",
        "namespace": "default",
        "creationTimestamp": "2022-08-02T15:16:21Z",
        "labels": {
          "app": "oomkilled",
          "pod-template-hash": "85d9cf68b6"
        }
      },
      "timestamp": "2022-08-02T15:16:00Z",
      "window": "15.953s",
      "containers": [
        {
          "name": "stress",
          "usage": {
            "cpu": "0",
            "memory": "19732Ki"
          }
        }
      ]
    }
  ]
}

mikelo avatar Aug 02 '22 15:08 mikelo

I think the memory usage counted here is about 19M, so there is no oom. By the way, this metric is not associated with metrics-server.

yangjunmyfm192085 avatar Aug 02 '22 15:08 yangjunmyfm192085

Hi @mikelo!

I wanted to deliberately create a pod that would go "out of memory"

I used a classical example from the kubernetes docs, ran it both on minikube and on kind, and I reproduced your issue.

While minikube shows OOMKilled status, kind somehow gets the pod in the Running state.

1/ minikube - oomkilled as expected

$ minikube start --driver=kvm2
😄  minikube v1.25.2 on Fedora 36
- snip -

$ kubectl create namespace mem-example
namespace/mem-example created

$ kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit-2.yaml --namespace=mem-example
pod/memory-demo-2 created

$ kubectl get pods -n mem-example
NAME            READY   STATUS      RESTARTS      AGE
memory-demo-2   0/1     OOMKilled   2 (23s ago)   30s

2/ kind - the pod is running

$ kind create cluster
enabling experimental podman provider
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.24.0) 🖼 
- snip -
 
$ kubectl create namespace mem-example
namespace/mem-example created

$ kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit-2.yaml --namespace=mem-example
pod/memory-demo-2 created

$ kubectl get pods -n mem-example
NAME            READY   STATUS    RESTARTS   AGE
memory-demo-2   1/1     Running   0          12s

I believe that the issue is not related to metrics-server but probably related to kind, so you could close this issue and continue the discussion in the kind repository.

tkrishtop avatar Aug 02 '22 16:08 tkrishtop

yes I agree, but in theory the application should take up 104M and hence go OOM... if this is kind issue I should close this one

mikelo avatar Aug 04 '22 13:08 mikelo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 02 '22 14:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 02 '22 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 01 '23 15:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 01 '23 15:01 k8s-ci-robot