gvisor
gvisor copied to clipboard
Resource consumption by python are not limited
Description
I'm building a sandbox service with gVisor. But the python seems to be able to apply unlimited memory while a bash script trying to apply unlimited memory are marked Error in Pod status.
Steps to reproduce
- Setup a kubernetes cluster with gVisor runtime class
- Apply the deploy below. It will try to apply 10GB memory.
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: memory-eater-python
name: memory-eater-python
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: memory-eater-python
template:
metadata:
labels:
app: memory-eater-python
spec:
containers:
- command:
- python
args: ["-c", "import sys; big_list = []; print('Attempting to allocate 100GB of memory...'); [big_list.append(' ' * 10**6) for _ in range(100000)]"]
image: python
name: ubuntu
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 999
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: 200m
ephemeral-storage: 200M
memory: "214748364"
dnsPolicy: Default
hostNetwork: true
restartPolicy: Always
runtimeClassName: gvisor
EOF
- After a while, run kubectl top
kubectl top pod -n default <pod-name>
I got the result. The memory is ~62GiB because in my pod because I'm trying to investigating why it makes our machine to be OOM. So, my pod apply ~100GiB memory.
NAME CPU(cores) MEMORY(bytes)
memory-eater-python-887b744f9-2snvs 984m 62654Mi
- As a negetive case, the bash script will be limited and pod will fail.
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: memory-eater-bash
name: memory-eater-bash
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: memory-eater-bash
template:
metadata:
labels:
app: memory-eater-bash
spec:
containers:
- command:
- bash
- -c
- big_var=data; while true; do big_var="$big_var$big_var"; done
image: python
name: ubuntu
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 999
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: 200m
ephemeral-storage: 200M
memory: "214748364"
dnsPolicy: Default
hostNetwork: true
restartPolicy: Always
runtimeClassName: gvisor
EOF
runsc version
runsc version release-20231009.0
spec: 1.1.0-rc.1
docker version (if using docker)
No response
uname
Linux 3090-k8s-node029 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:22Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}
repo state (if built from source)
No response
runsc debug logs (if available)
Haven't do it in the cluster
Hi; I can't seem to reproduce this, at least on GKE.
gVisor doesn't by itself do memory limiting; instead, it relies on the host Linux kernel to do this. It is set up here as part of container startup which eventually ends up here to control memory. This way, it limits both the total memory usage of the sum of the gVisor kernel and the processes within it with a single limit. If that goes over the limit, this should be killed by the Linux OOM killer, and this should be visible in dmesg
on the machine.
The enforcement mechanism depends on many moving parts, so I suggest checking all of them.
- The OOM killer must be enabled on the host Linux kernel.
- cgroupfs must be mounted on the host (typically at
/sys/fs/cgroup
). - Note that cgroupfs comes into two versions (v1 and v2) which changes things quite a bit.
- Make sure
runsc
's--ignore-cgroups
flag is not specified. - If you use
runsc
's--systemd-cgroup
, make sure you have systemd >= v244. - The
Linux.CgroupsPath
may need to be set properly in the OCI spec. It is probably incorrect (but need debug logs to check) - The gVisor shim can set the
dev.gvisor.spec.cgroup-parent
annotation to set the cgroups path as well (this would show up in debug logs).
If all of this is in place, please provide runsc debug logs, details on how you installed gVisor within the Kubernetes cluster (runsc
flags etc.), systemd version (systemd --version
), cgroup version (output of cat /proc/mounts
), which cgroup controllers are enabled (cat /sys/fs/cgroup/cgroup.controllers
).
Also please check #10371 which was filed recently after this issue and looks quite similar.
@EtiennePerot Thanks for replying! We have found the problem thanks to @charlie0129.
It ends up with that we didn't configure gvisor to use systemd-cgroup which is our cgroup manager in the cluster. After add systemd-cgroup
and upgrade the gvisor to the latest version, the OOM pod is properly killed by Linux. If I understand it correctly the default option is to use cgroupfs which is not the mainstream. Would it be better to move to systemd-cgroup as a defualt?
But I don't seem to find any related document/FAQs about cgroup manager. Forgive me if I miss it. And if there is truly not any of them. It would be kind to mention it somewhere in document.
Would it be better to move to systemd-cgroup as a default?
See discussion on https://github.com/google/gvisor/issues/10371 on this. Apparently runc
's default behavior is also systemd-group=false
, and runsc
needs to match runc
behavior in order to remain a drop-in replacement for it. But +1 on the need for documentation.