autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

VPA Recommender fails to record `OOM` and increase resources due to `KeyError`

Open nikimanoledaki opened this issue 2 years ago • 24 comments

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?: Component version: 1.0.0 | https://artifacthub.io/packages/helm/fairwinds-stable/vpa/4.0.1

What k8s version are you using (kubectl version)?: 1.27

What environment is this in?: AKS

What did you expect to happen?: Expected VPA to recover from an OOM issue by raising the resource limits.

What happened instead?:

We had a rather disruptive OOM loop that lasted for an hour (until the VPA CR's auto updates were disabled). It involved the VPA Updater & Recommender trying and failing to react to a VPA target Pod being OOMKilled. VPA recommended a very low memory request and limit for a Pod, which was immediately OOM'ed, evicted, OOM'ed, evicted again, etc. VPA should have been able to end this loop by raising the resource limit but it wasn't able to do that since the Pod was non-existant.

Here is the sequence of events:

  1. Updater accepts Pod for update. update_priority_calculator.go:143] pod accepted for update <ns>/<pod> with priority 0.2350253824929428

  2. Updater evicts Pod to apply resource recommendation. updater.go:215] evicting pod <pod>-69856fc5f7-m848k

  3. Recommender deletes Pod. cluster_feeder.go:401] Deleting Pod {<ns> <pod>-69856fc5f7-m848k}

  4. Recommender detects OOM in Pod. cluster_feeder.go:445] OOM detected {Timestamp:<date> 11:08:54 +0000 UTC Memory:104857600 ContainerID:{PodID:{Namespace:<ns> PodName:<pod>} ContainerName:<container>}}

  5. Updater detects quick OOM in Pod. update_priority_calculator.go:114] quick OOM detected in pod <ns>/<pod>, container <container>

  6. Recommender deletes Pod (again). cluster_feeder.go:401] Deleting Pod {<ns> <pod>-69856fc5f7-d8hvv}

  7. Recommender detects OOM in Pod (again). cluster_feeder.go:445] OOM detected {Timestamp:<date> 11:08:56 +0000 UTC Memory:104857600 ContainerID:{PodID:{Namespace:<ns> PodName:<pod>} ContainerName:<container>}}

  8. Recommender immediately fails to record OOM - Reason: KeyError. cluster_feeder.go:447] Failed to record OOM {Timestamp:<date> 11:08:56 +0000 UTC Memory:104857600 ContainerID:{PodID:{Namespace:<ns> PodName:<pod>} ContainerName:<container>}}. Reason: KeyError: {<ns> <pod>-69856fc5f7-d8hvv}

  9. repeat loop

From the codebase, it looks like that KeyError is returned when processing a non-existent pod (here). The pod didn't exist by the time the VPA Recommender tried to record the OOM. Due to this reason, the backup mechanism with the Custom memory bump-up after OOMKill wasn't able to increase the resource since VPA returned earlier with a KeyErorr (here) and it couldn't reach that point.

We took the following steps to stop the bleeding:

  • disabled the VPA targets by changing updatePolicy.updateMode from Auto to "Off". - - deleted the VPA targets manually.
  • increased controlledResources.minAllowed.memory to a more suitable number to avoid the initial OOM error.
  • followed VPA docs and increased Custom memory bump-up after OOMKill - although I don't believe that VPA reached that point since it exited by returning a KeyErorr earlier (here):
    • 'oom-bump-up-ratio': 2.0
    • 'oom-min-bump-up-bytes': 524288000

How to reproduce it (as minimally and precisely as possible):

The error could be reproduced by having a memory-intensive workload that OOMs very quickly.

nikimanoledaki avatar Apr 12 '24 10:04 nikimanoledaki

Hey @nikimanoledaki thanks for filing this and the great investigation! We've been observing similar behavior where the recommender drops down to the minAllowed values. So far, most occurrences seem to have been connected to the vpa-recommender restarting right before the recommendation dropped, maybe this is something you can also look at in your environments. I'd love to compare notes about the investigations at some point!

I tried fixing some part of this a while ago with https://github.com/kubernetes/autoscaler/pull/5326 but that doesn't help if a KeyError prevents actually adding reasonable memory samples to increase the recommendations ;(

voelzmo avatar Apr 19 '24 08:04 voelzmo

We've been using 1.1.1 for a while and while it isn't explicitly causing us trouble, we still get KeyError on OOM kill events pretty regularly, maybe 5 times daily.

akloss-cibo avatar May 22 '24 14:05 akloss-cibo

Thank you @voelzmo for your feedback!

most occurrences seem to have been connected to the vpa-recommender restarting right before the recommendation dropped

The VPA Recommender existed since at least 24h before the update occurred so I'm not sure that it was due to this unfortunately. A few other pieces of info that we found in the meantime:

  • The trigger of the quick OOM loop was when the VPA Updater capped the memory request of the given workload down to minAllowed, which had been set to 100MiB, which was far too low for this memory-intensive workload. This is what started the quick OOM loop.
  • Before this, the workload had been alive for exactly 12 hours before this, which is also exactly the VPA Updater's podLifetimeUpdateThreshold. This means that the VPA Updater was holding off from updating this Pod because it was deemed "short-lived."
  • It remains unclear what led to the capping decision.
  • I've noticed a few other quick OOM loops occur that get self-resolved after an hour or two. This is okay for some workloads but goes against SLX for others.
  • Since then, we've implemented an alert based on the quick OOM log so that we're alerted when this happens, to be able to disable VPA, investigate, fix any misconfiguration, and re-enable it.

Thank you @akloss-cibo! Do you also see a log that these are quick OOM kill events? e.g. update_priority_calculator.go:114] quick OOM detected in pod <ns>/<pod>, container <container>? Do these self-resolve after some time, or else, what do you do?

nikimanoledaki avatar May 23 '24 07:05 nikimanoledaki

/area vertical-pod-autoscaler

adrianmoisey avatar May 23 '24 09:05 adrianmoisey

No, we don't see quick OOM detected messages at all. We see messages like:

W0522 23:52:40.499263       1 cluster_feeder.go:447] Failed to record OOM {Timestamp:2024-05-22 23:50:42 +0000 UTC Memory:2621440000 ContainerID:{PodID:{Namespace:thingy-dev PodName:thingy-dev-745886787d-9rhv9} ContainerName:thingy}}. Reason: KeyError: {thingy-dev thingy-dev-745886787d-9rhv9}

Right now, these don't appear to be creating a problem for us. Most of the time, it is logged (unhelpfully) for a pod that doesn't have a VPA covering it at all, but we do see it occasionally for pods that do have a VPA.

akloss-cibo avatar May 23 '24 10:05 akloss-cibo

Interesting. Have you considered using VPA in memory-saver mode? I'm looking into this now. It would skip the KeyError log - see codebase here. At first I didn't want to skip this error since it pointed to this issue but now I'm not convinced about the value of this error after all - curious to hear what you think 🤔

Sidenote - memory-saver could also solve the VPA memory leak and reduce the resource utilization of VPA itself: https://github.com/kubernetes/autoscaler/issues/6368

nikimanoledaki avatar May 23 '24 11:05 nikimanoledaki

I'm not deeply familiar with VPA code, but my general purpose programming instinct isn't to ignore a KeyError like this; something isn't populating clusterState that should be. I suppose if clusterState is populated by something scheduled this is just a race condition, in which case it is "normal" for an OOM to occur for something not in clusterState; if this is true, I think the better solution is that clusterState.AddSample should handle this gracefully and add the appropriate entry to clusterState; it definitely shouldn't return a KeyError; even returning "PodNotKnownError" would be better than KeyError

I may try memory saver mode anyway though; we do a fair amount of batch processing in Kubernetes which creates a lot of pods without VPA on occasion. (I have a VPA for vpa-recommender itself to handle this.) This leads me to wonder, why does the VPA track pods that don't have a VPA (aka why isn't memory-saver the only mode for VPA)?

akloss-cibo avatar May 23 '24 14:05 akloss-cibo

This leads me to wonder, why does the VPA track pods that don't have a VPA (aka why isn't memory-saver the only mode for VPA)?

VPAs standard implementation for historic storage is VPACheckpoints, which means that if you decide to start tracking a certain workload with VPA, it takes ~8 days to get the most accurate recommendation. If you don't run in memory-saver mode, the VPA already gathers data about the containers in memory, so if you would decide to enable VPA for a new container, and the recommender was already running for 8 days, you would get accurate recommendations straight away.

I absolutely think --memory-saver=true should be the default, but depending on your use-cases, not running in memory-saver mode can be a useful thing.

voelzmo avatar May 28 '24 11:05 voelzmo

This is great feedback, @akloss-cibo & @voelzmo! My gut feeling was also to not avoid the KeyErrors which is why we have not enabled memory-saver yet. However the use case that @voelzmo describes makes sense to me. It sounds like memory-saver could make VPA easier to maintain. Currently the main issue with maintaining VPA in our large Kubernetes clusters is the frequent OOM Kill errors, which could indicate that VPA's Updater/Recommender could be losing state, which could be causing the bug described earlier in this issue.

See this feature request for HA VPA + my comment here: https://github.com/kubernetes/autoscaler/issues/6846#issuecomment-2136860926

nikimanoledaki avatar May 29 '24 08:05 nikimanoledaki

My gut feeling was also to not avoid the KeyErrors which is why we have not enabled memory-saver yet.

This is true: enabling memory-saver mode will mask the KeyErrors which arise due to other reasons. So if you don't have this mode enabled and still see KeyErrors (like described in this bug's OP), then this might be an indication that something else is going wrong. So for investigating this bug, it will probably be helpful to leave memory-saver=false.

voelzmo avatar May 29 '24 12:05 voelzmo

Hi folks! I have some updates and questions regarding the original issue.

We've seen this issue occur when VPA caps the memory req/lim to minAllowed. This value was set too low so the workload was OOMKilled almost immediately after spinning up and VPA did not have time to update its resources. The Pod is not found, hence the KeyError. Increasing minAllowed fixes the OOMKill issue.

However, some workloads have a memory utilization that can vary widely e.g. 100MiB - 1GiB+. Increasing minAllowed for all of them means that some are overprovisioned. This increases the cost significantly.

How does the VPA Updater decide to cap to minAllowed? The circumstances for this recommendation are unclear to me.

Secondly, for some workloads, this capping decision seems incorrect based on their previous utilization data. I'm not sure how to investigate this in the codebase. Do you think this could be due to the wide range of memory utilization for this type of workload?

nikimanoledaki avatar Jun 18 '24 11:06 nikimanoledaki

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 16 '24 12:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 16 '24 12:10 k8s-triage-robot

/remove-lifecycle rotten

BojanZelic avatar Nov 05 '24 19:11 BojanZelic

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 03 '25 19:02 k8s-triage-robot

Hi folks! I have some updates and questions regarding the original issue.

We've seen this issue occur when VPA caps the memory req/lim to minAllowed. This value was set too low so the workload was OOMKilled almost immediately after spinning up and VPA did not have time to update its resources. The Pod is not found, hence the KeyError. Increasing minAllowed fixes the OOMKill issue.

However, some workloads have a memory utilization that can vary widely e.g. 100MiB - 1GiB+. Increasing minAllowed for all of them means that some are overprovisioned. This increases the cost significantly.

How does the VPA Updater decide to cap to minAllowed? The circumstances for this recommendation are unclear to me.

Secondly, for some workloads, this capping decision seems incorrect based on their previous utilization data. I'm not sure how to investigate this in the codebase. Do you think this could be due to the wide range of memory utilization for this type of workload?

@nikimanoledaki My assumption is that you're hitting this bug https://github.com/kubernetes/autoscaler/issues/7726 which leads to a memory recommendation of 0. Capping to minAllowed only happens, if the calculated recommendation value is _below the minAllowed setting. What you're describing regarding differences in minAllowed values for your workload makes sense, my suggestion is to ensure that each VPA has the correct minAllowed value set for this specific workload, while the global minAllowed flag is the absolute minimum below which no workload in your cluster would be able to run (e.g. for us this is something around 10MiB).

We hope to get https://github.com/kubernetes/autoscaler/issues/7726 fixed soon now that we think we understood how it can happen – let's see if these KeyErrors still occur afterwards.

voelzmo avatar Feb 04 '25 08:02 voelzmo

/remove-lifecycle stale

voelzmo avatar Feb 04 '25 08:02 voelzmo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 05 '25 09:05 k8s-triage-robot

/remove-lifecycle stale

BojanZelic avatar May 06 '25 00:05 BojanZelic

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 04 '25 01:08 k8s-triage-robot

/remove-lifecycle stale

I think there's still potentially a bug in the area of how we're recording OOMKill samples.

voelzmo avatar Aug 04 '25 08:08 voelzmo

Also cross-linking issue https://github.com/kubernetes/autoscaler/issues/7867 and in particular the investigations I did in https://github.com/kubernetes/autoscaler/issues/7867#issuecomment-2724657062 describing how OOM samples should get into the histogram. Tracing this in a bit more detail could help understand if there's an issue with how recommender tries to record OOMkill samples.

voelzmo avatar Aug 07 '25 09:08 voelzmo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 05 '25 10:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 05 '25 11:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 04 '26 11:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 04 '26 11:01 k8s-ci-robot

/remove-lifecycle rotten /reopen

adrianmoisey avatar Jan 04 '26 12:01 adrianmoisey

@adrianmoisey: Reopened this issue.

In response to this:

/remove-lifecycle rotten /reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 04 '26 12:01 k8s-ci-robot