autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

VPA doesn't provide any recommendations when Pod is in OOMKill CrashLoopBackoff right after start

Open voelzmo opened this issue 2 years ago • 8 comments

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?:

Component version: 0.10.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version --short
Client Version: v1.24.2
Kustomize Version: v4.5.4
Server Version: v1.23.4

What environment is this in?:

What did you expect to happen?: VPA should be able to help with Pods which are in an OOMKill CrashLoopBackOff and raise Limits/Requests until the workload is running.

What happened instead?: VPA did not give a single Recommendation for a Pod that right from the start goes into an OOMKill CrashLoopBackOff

How to reproduce it (as minimally and precisely as possible): Create a deployment that will be OOMKilled right after starting

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oomkilled
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oomkilled
  template:
    metadata:
      labels:
        app: oomkilled
    spec:
      containers:
      - image: gcr.io/google-containers/stress:v1
        name: stress
        command: [ "/stress"]
        args: 
          - "--mem-total"
          - "104858000"
          - "--logtostderr"
          - "--mem-alloc-size"
          - "10000000"
        resources:
          requests:
            memory: 1Mi
            cpu: 5m
          limits:
            memory: 20Mi

Look at the container

(...)
    State:          Waiting                                                                                                                                                                                                                                                                    
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 20 Jun 2022 16:56:47 +0200
      Finished:     Mon, 20 Jun 2022 16:56:48 +0200
    Ready:          False
    Restart Count:  5
(...)

Create a VPA object for this deployment

apiVersion: "autoscaling.k8s.io/v1"
kind: VerticalPodAutoscaler
metadata:
  name: oomkilled-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: oomkilled
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 5m
          memory: 10Mi
        maxAllowed:
          cpu: 1
          memory: 5Gi
        controlledResources: ["cpu", "memory"]

VPA does observe the corresponding OOMKill events in the Recommender logs

I0620 14:55:04.340502       1 cluster_feeder.go:465] OOM detected {Timestamp:2022-06-20 14:53:52 +0000 UTC Memory:1048576 ContainerID:{PodID:{Namespace:default PodName:oomkilled-6868f896d6-6vfqm} ContainerName:stress}}
I0620 14:55:04.340545       1 cluster_feeder.go:465] OOM detected {Timestamp:2022-06-20 14:54:08 +0000 UTC Memory:1048576 ContainerID:{PodID:{Namespace:default PodName:oomkilled-6868f896d6-6vfqm} ContainerName:stress}}

VPA Status is empty

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling.k8s.io/v1","kind":"VerticalPodAutoscaler","metadata":{"annotations":{},"name":"oomkilled-vpa","namespace":"default"},"spec":{"resourcePolicy":{"containerPolicies":[{"containerName":"*","controlledResources":["cpu","memory"],"maxAllowed":{"cpu":1,
"memory":"5Gi"},"minAllowed":{"cpu":"5m","memory":"10Mi"}}]},"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"oomkilled"}}}
  creationTimestamp: "2022-06-20T14:54:16Z"
  generation: 2
  name: oomkilled-vpa
  namespace: default
  resourceVersion: "299374"
  uid: f47d84a8-aa6e-4042-b0a4-723888720a9d
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources:
      - cpu
      - memory
      maxAllowed:
        cpu: 1
        memory: 5Gi
      minAllowed:
        cpu: 5m
        memory: 10Mi
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: oomkilled
  updatePolicy:
    updateMode: Auto
status:
  conditions:
  - lastTransitionTime: "2022-06-20T14:55:04Z"
    status: "False"
    type: RecommendationProvided
  recommendation: {}

VPACheckpoint doesn't record any measurements

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscalerCheckpoint
metadata:
  creationTimestamp: "2022-06-20T14:55:04Z"
  generation: 24
  name: oomkilled-vpa-stress
  namespace: default
  resourceVersion: "304997"
  uid: 127a6331-7d1d-4ea6-b56a-63db3ee07a51
spec:
  containerName: stress
  vpaObjectName: oomkilled-vpa
status:
  cpuHistogram:
    referenceTimestamp: null
  firstSampleStart: null
  lastSampleStart: null
  lastUpdateTime: "2022-06-20T15:18:04Z"
  memoryHistogram:
    referenceTimestamp: "2022-06-22T00:00:00Z"
  version: v3

The Pod in CrashLoopBackOff doesn't have any PodMetrics, whereas other Pods do have metrics

k get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/" | jq
{
  "kind": "PodMetricsList",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metadata": {
        "name": "hamster-96d4585b7-b9tl9",
        "namespace": "default",
        "creationTimestamp": "2022-06-20T15:24:37Z",
        "labels": {
          "app": "hamster",
          "pod-template-hash": "96d4585b7"
        }
      },
      "timestamp": "2022-06-20T15:24:01Z",
      "window": "56s",
      "containers": [
        {
          "name": "hamster",
          "usage": {
            "cpu": "498501465n",
            "memory": "512Ki"
          }
        }
      ]
    },
    {
      "metadata": {
        "name": "hamster-96d4585b7-c44j7",
        "namespace": "default",
        "creationTimestamp": "2022-06-20T15:24:37Z",
        "labels": {
          "app": "hamster",
          "pod-template-hash": "96d4585b7"
        }
      },
      "timestamp": "2022-06-20T15:24:04Z",
      "window": "57s",
      "containers": [
        {
          "name": "hamster",
          "usage": {
            "cpu": "501837091n",
            "memory": "656Ki"
          }
        }
      ]
    }
  ]
}

Anything else we need to know?: On the same cluster, the hamster example works perfectly fine and gets recommendations as expected, so this is not a general issue with the VPA.

I just for fun applied this patch which increases the TotalSamplesCount when a memory sample (i.e. also an OOMKill sample) is added and afterwards the above Pod gets a recommendation and can run normally – as expected. I understand that the fix cannot be as simple as that, otherwise we would add two samples for every regular PodMetric (which contains both, CPU and memory), and existing implementations assume otherwise, I guess, but this is just to show that TotalSamplesCount seems to be the blocker in this situation.

diff --git a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
index 3facbe37e..7accd072e 100644
--- a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
+++ b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
@@ -184,6 +184,7 @@ func (a *AggregateContainerState) AddSample(sample *ContainerUsageSample) {
        case ResourceCPU:
                a.addCPUSample(sample)
        case ResourceMemory:
+               a.TotalSamplesCount++
                a.AggregateMemoryPeaks.AddSample(BytesFromMemoryAmount(sample.Usage), 1.0, sample.MeasureStart)
        default:
                panic(fmt.Sprintf("AddSample doesn't support resource '%s'", sample.Resource))

voelzmo avatar Jun 20 '22 15:06 voelzmo

ping @jbartosik that's what I was mentioning in today's SIG call

voelzmo avatar Jun 20 '22 15:06 voelzmo

maybe a.TotalSamplesCount++ should run when OOM is detected...

mikelo avatar Jun 23 '22 14:06 mikelo

I think I saw this problem some tie ago. When I was implementing OOM tests for VPA.

Test didn't work if memory usage grew too quickly - pods were OOMing but VPA wasn't increasing its recommendation.

My plan is:

  • Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation,
  • Add logging to VPA recommender to see if it's getting information about OOMs (I think here)
  • If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case),
  • If we don't get the information read up / ask about how we could get it,
  • If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.

I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back

jbartosik avatar Jul 01 '22 09:07 jbartosik

I think I saw this problem some tie ago. When I was implementing OOM tests for VPA.

Test didn't work if memory usage grew too quickly - pods were OOMing but VPA wasn't increasing its recommendation.

My plan is:

  • Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation,
  • Add logging to VPA recommender to see if it's getting information about OOMs (I think here)
  • If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case),
  • If we don't get the information read up / ask about how we could get it,
  • If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.

I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back

jbartosik avatar Jul 01 '22 09:07 jbartosik

Ah, it's good to hear you already saw something similar!

My plan is:

Locally modify the e2e to grow memory usage very quickly, verify that VPA doesn't grow the recommendation, Add logging to VPA recommender to see if it's getting information about OOMs (I think here) If we get information but it doesn't affect recommendation then debug why (I think this is the most likely case), If we don't get the information read up / ask about how we could get it, If the test passes even when it grows memory usage very quickly then figure out how it's different from your situation.

I'll be away for next 2 weeks. I'll only be able to start doing this when I'm back

I can also take some time to do this – I don't think the scenario should be too far away from my repro case above. The modifications to the existing OOMObserver makes sense to verify that the correct information is really there – in my repro case above I thought seeing the logs here was sufficient evidence that the VPA sees the OOM events with the right amount of memory, and that adding a TotalSampleCount++ lead to getting the correct recommendation showed that the information in the OOM events was as expected.

voelzmo avatar Jul 04 '22 11:07 voelzmo

Adapted the existing OOMKill test, such that the Pods run more quickly into OOMKills and eventually end in a CrashLoopBackOff: https://github.com/kubernetes/autoscaler/pull/5028

voelzmo avatar Jul 14 '22 08:07 voelzmo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 12 '22 09:10 k8s-triage-robot

/remove-lifecycle stale

jbartosik avatar Oct 12 '22 09:10 jbartosik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 10 '23 10:01 k8s-triage-robot

/remove-lifecycle stale

voelzmo avatar Jan 11 '23 16:01 voelzmo

/remove-lifecycle stale

runningman84 avatar Aug 29 '23 10:08 runningman84