enhancements
enhancements copied to clipboard
[KEP-2400] Node swap updates, GA criterias and clarifications
- One-line PR description: Add updates, GA criterias and clarifications
- Issue link: https://github.com/kubernetes/enhancements/issues/2400
- Other comments:
This PR updates the KEP in the following ways:
Emphasize that this KEP is about basic swap enablement The original KEP indicated that pod-level swap APIs are out of scope: https://github.com/kubernetes/enhancements/blob/155a949378fe85d4ca936176ad48103bf9567402/keps/sig-node/2400-node-swap/README.md?plain=1#L163-L166 https://github.com/kubernetes/enhancements/blob/155a949378fe85d4ca936176ad48103bf9567402/keps/sig-node/2400-node-swap/README.md?plain=1#L142-L144
However, the lack of APIs and the implicit nature of the current implementation sometimes brings suggestions to extend the API under this KEP.
This KEP focuses on a basic swap enablement. Follow-up KEPs regarding several topics (e.g. customization, zram/zswap suport, and more) will be introduced in the near future, in which we would be able to design and implement each extension in a focused way.
To ensure we're on the same page, this topic was recently raised in a sig-node meeting. In this meeting there was a very broad consensus that this approach makes sense, especially since the NodeSwap feature is important to many different parties which want it to "just work".
This PR updates the KEP to emphasize this approach.
GA criterias The PR adds GA criterias, alongside the intent to GA in version 1.32.
Make sure PRR is ready
Updates Since the last KEP updates, many improvements were made and many concerns were addressed. For example:
- Memory-backed volumes
- Added metrics
- Kubelet Configuration examples
- more
This PR updates the KEP to reflect these updates.
/cc @SergeyKanzhelev @haircommander @harche
Please update https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/2400.yaml and update missing bits of the PRR questionaire.
Thanks @deads2k!
Please update
master/keps/prod-readiness/sig-node/2400.yaml
I see you're the assigned approver for alpha/beta. Is it OK to also assign you as the approver for GA?
and update missing bits of the PRR questionaire.
Done! PTAL :)
/retitle [KEP-2400] Node swap ppdates, GA criterias and clarifications
D'oh
/retitle [KEP-2400] Node swap updates, GA criterias and clarifications
@sftim @deads2k @haircommander @SergeyKanzhelev @mrunalp
Would you kindly have another look at this? Is this anything missing? I believe that most of this PR has already been agreed upon in previous sig-node calls. Thanks!
We have been using this feature for a while in production and it generally performs well. We still need to set --experimental-allocatable-ignore-eviction on kubelet to allow setting memory eviction threshold to 0. Otherwise, kubelet might evict pods before swapping them out. With that option it has been pretty solid.
/milestone 1.32
@haircommander: The provided milestone is not valid for this repository. Milestones in this repository: [v1.25, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]
Use /milestone clear to clear the milestone.
In response to this:
/milestone 1.32
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/milestone v1.32
I wish to update that we agreed [1] that NFD support for swap brings enough debugability for this KEP. We'll definitely circle back to improving this as part of the follow-up API extension KEP. Thanks @kannon92 for bringing swap support to NFD, and thanks @dchen1107 for helping pushing the discussion forward!
With this, I believe that this PR is entirely ready to get in. @sftim @deads2k @haircommander @SergeyKanzhelev @mrunalp - can you please have another look so we can push this forward? Thanks in advance!
[1] The discussion started at the sig-node meeting, then continued to this comment.
Small nits but otherwise this looks great!
Small nits but otherwise this looks great!
Thank you very much for your review!
/lgtm
/assign @dchen1107
@SergeyKanzhelev thanks a lot for taking a look!
There's one general clarification I want to make w.r.t. the scope of this KEP vs follow-up KEPs. As written in the PR description, this KEP focuses on basic swap enablement. There are many ways to extend the swap feature, but since many different parties are interested in using swap in production (and already do) we rather tackle this in an incremental fashion, bring basic swap support first, and then extend it in many ways.
The two main follow-ups that we're already actively discussing and that would be presented to the community very soon are:
- Extending customizability: to let users customize how swap is being allocated to pods, instead of relying only on the automatic limit calculation.
- Handling evictions: adapt the eviction mechanism to be more swap aware. In the further future, we're thinking to extend the eviction manager to deal with IO pressure, and maybe even introduce a plugin system.
These are still work-in-progress and under heavy discussions, but we already work hard on curving them out, and we'll present them to the community very soon.
The intention to scope this KEP with basic swap enablement only and tackle the other issues in follow-up KEPs have been discussed in many places already, including the sig-node meeting, which seemed to gain a large consensus.
We have been using this in production for more than a year now. It generally works well and helps with performance (mostly due to more available cache memory when unused java stuff is paged out).
There are limitations if you want to use memory reservations which exceed physical memory. This is very useful with java workload (especially in CI environments). Here you might see evictions for pods which did not exceed their memory reservation due to (in my opinion) false memory presure detected by kubelet. There is a workaround (which requires an experimental flag in kubelet) to disable memory evictions. This has worked well for us since nodes are able to reclaim memory by swapping.
Overall its a feature which works well and is definitely ready for usage. It wont enable automatically anyway since you have to allow swap explicitly.
There are limitations if you want to use memory reservations which exceed physical memory. This is very useful with java workload (especially in CI environments). Here you might see evictions for pods which did not exceed their memory reservation due to (in my opinion) false memory presure detected by kubelet. There is a workaround (which requires an experimental flag in kubelet) to disable memory evictions. This has worked well for us since nodes are able to reclaim memory by swapping.
This is concerning. False positive evictions sounds like non-ideal user experience. @iholder101 should we fix something to improve the experience? @jabdoa2 what flag are you referring to?
There are limitations if you want to use memory reservations which exceed physical memory. This is very useful with java workload (especially in CI environments). Here you might see evictions for pods which did not exceed their memory reservation due to (in my opinion) false memory presure detected by kubelet. There is a workaround (which requires an experimental flag in kubelet) to disable memory evictions. This has worked well for us since nodes are able to reclaim memory by swapping.
This is concerning. False positive evictions sounds like non-ideal user experience. @iholder101 should we fix something to improve the experience? @jabdoa2 what flag are you referring to?
It is called --experimental-allocatable-ignore-eviction and allows you to set the eviction threshold to 0. I explained the background here: https://github.com/kubernetes/kops/issues/15821
- Handling evictions: adapt the eviction mechanism to be more swap aware. In the further future, we're thinking to extend the eviction manager to deal with IO pressure, and maybe even introduce a plugin system.
If we are moving evictions out of scope for this KEP, we need to update the KEP readme and list the results of investigations mentioned there, update documentation to explain how things work today, suggest some work arounds, etc. We cannot just say that with a GA feature we know that there are false positive evictions are expected.
The intention to scope this KEP with basic swap enablement only and tackle the other issues in follow-up KEPs have been discussed in many places already, including the sig-node meeting, which seemed to gain a large consensus.
I also getting feedback, but as I mentioned mostly from Unlimited scenario. I am not against moving this forward, I just want to make sure we are not misleading customers who want to use it and warning about all the limitations we have in this feature. Also I haven't looked deeply into eviction, so if you can post more details on what is the today's behavior, and why it is OK to GA with it, it will help.
@jabdoa2 Hi! thank you for the insight. Can you please share your experience working w/ swap and w/o memory hard eviction in production? How do you manage global node memory pressure situations in terms of monitoring and mitigation? (i.e. swap on NVME, zswap, zram, swap related OS tuning, user-space tools such as oomd)
@jabdoa2 Hi! thank you for the insight. Can you please share your experience working w/ swap and w/o memory hard eviction in production? How do you manage global node memory pressure situations in terms of monitoring and mitigation? (i.e. swap on NVME, zswap, zram, swap related OS tuning, user-space tools such as oomd)
In most cases we actually run swap on EBS (gp3). That limits us to 3k iops. We monitor for that but hardly ever hit that (maybe 3/4 times over 15 clusters in almost two years). (We use nvme on nodes which have it but its rather rare in our setup.) We add 50% swap on top (i.e. on nodes with 32GB we add 16GB swap). We also make sure that pods cannot use a limit which exceeds its (more or less) guaranteed swap slice. In our case that means that you cannot set memory limit about 150% of your reservation (this KEP makes sure that you get the same swap percentage as your reservation). This way it virtually Impossible to exceed swap on a node. All pods would have to use all their memory at the same time just to fill it to 100%. Often we still see 100% swap usage and nodes with quite a bit of free memory. Limux will happily use that memory as cache which improves performance by a lot.
On top of ios we monitor the usual prometheus metrics (eviction rate, unready pods/deployments etc).
Just to make sure: Disabling eviction will not stop linux OOM killer so those events still happen (though very rare). Also if you hit your memory limit cgroups will terminate your container.
I guess if you would run your workload without (reasonable) memory limit you would be in pain. However, you would also be in pain without swap. With swap it will just take a bit longer until stuff hits the fan ;-).
@jabdoa2 Thank you very much for your feedback! It is absolutely valuable and important!
This is concerning. False positive evictions sounds like non-ideal user experience.
I explained the background here: https://github.com/kubernetes/kops/issues/15821
@SergeyKanzhelev the term "false positive" is inaccurate. In the linked issue, the author writes:
when we use more memory than RAM on the node pods get eviced by Kubernetes even though the node has a lot of memory available (due to be able to use swap).
In other words, the author expects that the eviction manager would treat swap memory as "regular" RAM memory, which is not how the eviction manager works today. Since the eviction manager is out-of-scope for this KEP, it remains swap-unaware and evicts on the basis of RAM memory alone while ignoring swap memory.
If we are moving evictions out of scope for this KEP, we need to update the KEP readme and list the results of investigations mentioned there, update documentation to explain how things work today, suggest some work arounds, etc. We cannot just say that with a GA feature we know that there are false positive evictions are expected.
I am not against moving this forward, I just want to make sure we are not misleading customers who want to use it and warning about all the limitations we have in this feature. Also I haven't looked deeply into eviction, so if you can post more details on what is the today's behavior, and why it is OK to GA with it, it will help.
@SergeyKanzhelev You're absolutely right.
I've added the following commit that adds a section for evictions under Risks and Mitigations: b5b281d (#4701). In this commit I explain the current limitations, why they are acceptable and some best practices regarding them. In addition, I've updated 45b1bc9 (#4701) so that adding documentation regarding eviction limitations is added as a GA criteria.
To summarize the added sections in one sentence: the worst case scenario with this KEP is that either kubelet would start evicting before the kernel has a chance to swap memory, or the kernel would swap memory before kubelet has a chance to evict pods. Both of these scenarios are acceptable. In the best scenario, the memory eviction threshold would be tuned according to the kernel's swap watermarks, as now explained in the KEP.
I emphasize again that there are no false positive evictions taking place. The eviction manager's logic remains and it simply does not take swap into account. This is an acceptable an expected behavior that can (and will!) be improved in a follow-up KEP.
Hi @iholder101 @SergeyKanzhelev,
When talking about eviction, I think it’s important to mention The eviction order, which monitors pod resource usage, including memory. Currently, the working set calculation doesn’t consider swap usage. It might be worth adding this detail to the documentation. What do you think?
Thanks!
Hi @iholder101 @SergeyKanzhelev,
When talking about eviction, I think it’s important to mention The eviction order, which monitors pod resource usage, including memory. Currently, the working set calculation doesn’t consider swap usage. It might be worth adding this detail to the documentation. What do you think?
Thanks!
That is why it feels a bit like false positives when you start using swap. This clearly only affects pods with swap. But for those you will see evictions even though they did not exceed their reserved memory nor their "guaranteed" swap. As I wrote earlier you can work around that with that (deprecated) experimental flag to disable memory evictions. In my opinion that should be clearly stated in the limitations and also explain how to apply the workaround.
Hi @iholder101 @SergeyKanzhelev, When talking about eviction, I think it’s important to mention The eviction order, which monitors pod resource usage, including memory. Currently, the working set calculation doesn’t consider swap usage. It might be worth adding this detail to the documentation. What do you think? Thanks!
That is why it feels a bit like false positives when you start using swap. This clearly only affects pods with swap. But for those you will see evictions even though they did not exceed their reserved memory nor their "guaranteed" swap. As I wrote earlier you can work around that with that (deprecated) experimental flag to disable memory evictions. In my opinion that should be clearly stated in the limitations and also explain how to apply the workaround.
Yes, disabling hard memory eviction is one option (not a bad one in my opinion), but I was referring to this commit that suggests a better configuration for the hard eviction threshold. I think it should also warn that the eviction order doesn't consider the used swap. Apologies if I wasn't clear.
@dchen1107 @SergeyKanzhelev PTAL
Thank you for specifying the metrics. PRR looks good.
/approve
Thank you @deads2k!
This is the current status for this KEP:
- [X] Test a wide variety of scenarios that may be affected by swap support, including tests with aggressive memory stress.
- [X] Address memory-backed backed volumes which should not have access to swap.
- [ ] Remove feature gate: a draft PR is ready: https://github.com/kubernetes/kubernetes/pull/127756.
- [X] Exclude high-priority, static and mirrored pods from gaining access to swap.
- [ ] Add documentation regarding encrypted swap. A PR to add a blog-post about swap GA with these details is ready: https://github.com/kubernetes/website/pull/48099. After it's merged, I'll add another PR to add general docs.
- [ ] Add documentation regarding limitations around evictions. Same as above.
So things look pretty good :+1:
There are several discussion back and forth, and we couldn't reach the consensus. Honestly I tried to convince the folks to move this forward to GA without disabling the eviction manager by default. Instead documenting the risk and potential workaround, but failed to convince myself, not even mentioning others.
I think we should build a swap-aware eviction manager before GA the feature. A swap-aware eviction manager would:
- Prevent unnecessary evictions: Only evict pods when both physical memory and swap space are exhausted.
- Improve resource utilization: Allow pods to fully utilize swap space without the risk of premature eviction.
- Enhance predictability: Make the system's behavior under memory pressure more predictable and easier to manage.
@SergeyKanzhelev also pointed out that there isn't enough user feedback on the feature. Sergey, can you share more on this?