buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Leaking containerd.sock mounts exponentially on host node

Open brettmorien opened this issue 10 months ago • 5 comments

Running a cluster with heavy shared developer use, we've noticed Kubernetes nodes getting into a state where any pods can't come up or terminate. We've determined that on the nodes that get into this state, they have a pathological amount (32,766) of tmpfs /run/containerd/containerd.sock mounts. We've isolated this to spinning up buildkit pods.

Steps to reproduce

  • Run a k8s cluster
  • Terminal into a k8s node running buildkit and run cat /proc/mounts | grep containerd.sock | wc -l
  • Cycle buildkit pods repeatedly and observe the count growing exponentially every time a new pod is started on that node

What actually happens The host node roughly doubles the current number of containerd.sock mounts every time a new buildkit pod starts

The node becomes unusable when cat /proc/mounts | grep containerd.sock count hits 32,766 rows that look like this: tmpfs /run/containerd/containerd.sock tmpfs rw,nosuid,nodev,mode=755 0 0

We have repro'd this issue on v0.13.1 with --containerd-worker=true as well as v0.12.5 with --containerd-worker=false.

Versions Kubelet: Kubernetes v1.24.17-eks-5e0fdde Docker: Docker version 20.10.25, build b82b9f3 Containerd: github.com/containerd/containerd 1.7.11 64b8a811b07ba6288238eefc14d898ee0b5b99ba

brettmorien avatar Mar 28 '24 18:03 brettmorien

How do you run BuildKit and containerd? What makes the mount for containerd.sock? Unlike docker.sock, containerd.sock is not expected to be bind-mounted currently.

AkihiroSuda avatar Mar 29 '24 17:03 AkihiroSuda

Since this Buildkit is shared by several developers, we've built a chart that does a Buildkit deployment that everybody connects to. I had an HPA attached, which exacerbated this problem as the aggressive scaling caused lots of new mounts. The contents of the deployment.yaml is based on the deployment put out by the buildkit CLI.

The behavior we see is every scale up, a new pod is scheduled and the number of mounts becomes 2^X - 1 where X is the number of pods ever deployed to that node during its lifetime.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    proxy.istio.io/config: '{"holdApplicationUntilProxyStarts":true}'
  labels:
    {{- include "buildkit.labels" . | nindent 4 }}
    app: buildkit
    rootless: "false"
    runtime: containerd
    worker: containerd
  name: buildkit
spec:
  progressDeadlineSeconds: 600
  replicas: {{ .Values.replicas }}
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      {{- include "buildkit.selectorLabels" . | nindent 6 }}
      app: buildkit
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: buildkit
        rootless: "false"
        runtime: containerd
        worker: containerd
        {{- include "buildkit.selectorLabels" . | nindent 8 }}
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: buildkit
                  rootless: "false"
                  runtime: containerd
                  worker: containerd
              topologyKey: kubernetes.io/hostname
      containers:
        - args:
            - --oci-worker=true
            - --containerd-worker=true
            - --root
            - /var/lib/buildkit/buildkit
          image: docker.io/moby/buildkit:buildx-stable-1
          imagePullPolicy: IfNotPresent
          name: buildkitd
          readinessProbe:
            exec:
              command:
                - buildctl
                - debug
                - workers
            failureThreshold: 3
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources: {{- toYaml .Values.resources | nindent 12 }}
          securityContext:
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /etc/buildkit/
              name: buildkitd-config
            - mountPath: /run/containerd/containerd.sock
              name: containerd-sock
            - mountPath: /var/lib/buildkit/buildkit
              mountPropagation: Bidirectional
              name: var-lib-buildkit
            - mountPath: /var/lib/containerd
              mountPropagation: Bidirectional
              name: var-lib-containerd
            - mountPath: /run/containerd
              mountPropagation: Bidirectional
              name: run-containerd
            - mountPath: /var/log
              mountPropagation: Bidirectional
              name: var-log
            - mountPath: /tmp
              mountPropagation: Bidirectional
              name: tmp
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            defaultMode: 420
            name: buildkit
          name: buildkitd-config
        - hostPath:
            path: /run/containerd/containerd.sock
            type: Socket
          name: containerd-sock
        - hostPath:
            path: /var/lib/buildkit/buildkit
            type: DirectoryOrCreate
          name: var-lib-buildkit
        - hostPath:
            path: /var/lib/containerd
            type: Directory
          name: var-lib-containerd
        - hostPath:
            path: /run/containerd
            type: Directory
          name: run-containerd
        - hostPath:
            path: /var/log
            type: Directory
          name: var-log
        - hostPath:
            path: /tmp
            type: Directory
          name: tmp

brettmorien avatar Mar 29 '24 17:03 brettmorien

mountPath: /run/containerd/containerd.sock

This is not supported

AkihiroSuda avatar Mar 29 '24 17:03 AkihiroSuda

I definitely believe the issue is related to that mount. I copied the deployment resource basically as-is from a k8s environment after running a buildkit command, so I'm not sure how this mount gets in there (or generally how the deployment resource is constructed).

Editing to add: I think this is starting to look like a kubectl-build issue, which I should probably be working to eliminate from the process now that it's redundant. It wasn't clear to me where the boundary was between these two projects.

brettmorien avatar Mar 29 '24 17:03 brettmorien

Ok to close this issue then @brettmorien ?

thompson-shaun avatar May 16 '24 15:05 thompson-shaun