cri-dockerd BestEffort pods are using swap

BestEffort pods are using swap

Open robertbotez opened this issue 3 months ago • 5 comments

What happened?

I already opened a ticket on kube repo which leaded me here.

I was testing the support for swap and I came to an unexpected behavior. In the documentation it is specified that only pods that fall under the Burstable class can use the host's swap memory. However, I created both a deployment with 1 replica of ubuntu belonging to the Burstable class, and one belonging to the BestEffort class, where I ran the command stress --vm 1 --vm-bytes 6G --vm-hang 0 to see the consumption of memory made. The host has 4GB RAM memory and 5GB swap. In both situations, the pod started using swap after exceeding the RAM memory requirement. Wasn't the BestEffort pod supposed to be restarted when it reached the limit of the host's RAM memory? I mention that the kubelet is configured to swapBehavior=LimitedSwap. I attached two pictures where you can see the normal consumption of host, and consumption after running stress command inside pod Screenshot 2024-03-27 at 12 02 30 . screenshot_2024-03-27_at_12 37 02

What did you expect to happen?

I expected the BestEffort pod to be killed when it consumes more RAM memory than the host have available.

How can we reproduce it (as minimally and precisely as possible)?

setup a VM running ubuntu 22.04 with 4GB of RAM memory
set swap partition to 5GB
install docker, cri-dockerd and kubernetes packages using the provided versions
config kubelet with provided config
install calico cni
after the cluster is bootstrapped, deploy the following deployment

$ $ cat test.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ubuntu-deployment
  labels:
    app: ubuntu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ubuntu
  template:
    metadata:
      labels:
        app: ubuntu
    spec:
      containers:
      - name: ubuntu
        image: ubuntu:22.04
        resources:
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do sleep 30; done;" ]

this should deploy a BestEffort pod. you can check this by running kubectl get pod <pod-name> --output=yaml
exec into the pod and do apt update & apt install stress. Then run stress --vm 1 --vm-bytes 6G --vm-hang 0
check the node where the pod is running with kubectl get po -o wide then ssh to that node and run htop. Now you should see that the deployed BestEffort pod is consuming swap memory, which according to the Docs, it shouldn't.
if exec into the pod and check memory.swap.max, this is set to max. From what I understand, even if swapBehavior was set to LimitedSwap in kubelet, somehow cri-dockerd may be set the cgroup for memory.swap.max to max.

$ cat /sys/fs/cgroup/memory.swap.max 
max

Anything else we need to know?

I am using cgroup v2.

Here is my kubelet config.

$ cat /var/lib/kubelet/config.yaml 
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
 anonymous:
  enabled: false
 webhook:
  cacheTTL: 0s
  enabled: true
 x509:
  clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
 mode: Webhook
 webhook:
  cacheAuthorizedTTL: 0s
  cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: ""
cpuManagerReconcilePeriod: 0s
enableServer: true
evictionPressureTransitionPeriod: 0s
failSwapOn: false
featureGates:
 NodeSwap: true
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMaximumGCAge: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
 flushFrequency: 0
 options:
  json:
   infoBufferSize: "0"
 verbosity: 0
memorySwap:
 swapBehavior: LimitedSwap
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

Kubernetes version

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3

Cloud provider

Hetzner Cloud, but Kubernetes was deployed using `kubeadm`.

OS version

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux fs-kube-dev-1 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.3", GitCommit:"6813625b7cd706db5bc7388921be03071e1a492d", GitTreeState:"clean", BuildDate:"2024-03-15T00:06:16Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}

Container runtime (CRI) and version (if applicable)

$ cri-dockerd --version
cri-dockerd 0.3.11 (9a8a9fe)

Related plugins (CNI, CSI, ...) and versions (if applicable)

calico: version: 3.27.2

Apr 01 '24 07:04 robertbotez

cri-dockerd cri-dockerd copied to clipboard

BestEffort pods are using swap

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

cri-dockerd
cri-dockerd copied to clipboard