runc
runc copied to clipboard
cgroup v1/v2 compatibility issue when setting memory below the current usage
With cgroup v1, when we set the memory limit to below the current usage (runc update on a running container), the kernel returns EBUSY and runc fails with a nice error message:
ERRO[0000] unable to set memory limit to 27033 (current usage: 270336, peak usage: 6082560)
With cgroup v2, when do do this, kernel OOM killer just kill the container. This makes this behavior incompatible with cgroup v1.
One (imperfect) workaround is to add a flag to OCI spec that disallows to set memory limit to the value lower than the current usage. This is borderline ugly but at least in most cases we'll return an error instead of letting the container being OOM killed.
(the other, much less serious part of the problem is, when container is disappearing in the middle of runc update, we get all sorts of ugly messages)
could we use memory.high instead of memory.max?
I don't have a complete understanding at this point but are we talking about cgroup memory limit applied at the time of container creation? And if that's the case, is the difference then the fact that in cgroupv2 the kernel isn't returning an EBUSY anymore?
add a flag to OCI spec
And then have runc parse it and fail early instead of the container being OOMKilled?
This is when we try to update the memory limit of an already running container to a value that is less than what it is currently using. In v1, we got EBUSY, but in v2, kernel applies the value and if it is low, the container is OOM Killed.
could we use
memory.highinstead ofmemory.max?
From the vertical pod autoscaler POV -- yes. Meaning, it will still have to distinguish between v1 and v2. Meaning, it does not make sense to add a flag I have proposed in the description.
could we use memory.high instead of memory.max
I think that will have to be phase 2 with cgroups v2 in k8s. Phase 1 is just a direct mapping to v1.
Is it possible to get the current memory usage from memory.current and if it is lower than that, not update it and return an error? This may be too much help as OCI runtime...?
Is there a similar problem with other configurations other than memory?
Is there a similar problem with other configurations other than memory?
Not that I know of.