kind icon indicating copy to clipboard operation
kind copied to clipboard

document how to run kind in a kubernetes pod

Open BenTheElder opened this issue 6 years ago • 50 comments

NOTE: We do NOT recommend doing this if it is at all avoidable. We don't have another option so we do it ourselves, but it has many footguns.

xref: #284 additionally these mounts are known to be needed:

    volumeMounts:
      # not strictly necessary in all cases
      - mountPath: /lib/modules
        name: modules
        readOnly: true
      - mountPath: /sys/fs/cgroup
        name: cgroup
   volumes:
    - name: modules
      hostPath:
        path: /lib/modules
        type: Directory
    - name: cgroup
      hostPath:
        path: /sys/fs/cgroup
        type: Directory

thanks to @maratoid

/kind documentation /priority important-longterm

We probably need a new page in the user guide for this.

EDIT: Additionally, for any docker in docker usage the docker storage (typically /var/lib/docker) should be a volume. A lot of attempts at using kind in Kubernetes seem to miss this one. Typically an emptyDir is suitable for this.

EDIT2: you also probably want to set a pod DNS config to some upstream resolvers so as not to have your inner cluster pods trying to talk to the outer cluster's DNS which is probably on a clusterIP and not necessarily reachable.

 dnsPolicy: "None"
  dnsConfig:
    nameservers:
     - 1.1.1.1
     - 1.0.0.1

EDIT3: Loop devices are not namespaced, follow from #1248 to find our current workaround

BenTheElder avatar Feb 15 '19 00:02 BenTheElder

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar May 16 '19 01:05 fejta-bot

/remove-lifecycle stale

BenTheElder avatar May 16 '19 01:05 BenTheElder

this came up again in #677 and again today in another deployment /assign

BenTheElder avatar Jul 02 '19 18:07 BenTheElder

see this about possibly inotify watch limits on the host and a work around https://github.com/kubernetes-sigs/kind/issues/717#issuecomment-513070836

this issue may also apply to other linux hosts (non-kubernetes)

BenTheElder avatar Jul 19 '19 21:07 BenTheElder

For future reference, here's a working pod spec for running kind in a pod: (Add your own image) (cc @BenTheElder - is this a sane pod spec for kind?)

That being said, there should also be documentation for:

  • why kind needs the volume mounts and what impact they have on the underlying node infrastructure
  • what happens when the pod is terminated before deleting the cluster (in the context of https://github.com/kubernetes-sigs/kind/issues/658#issuecomment-505704699)
  • configuring garbage collection for unused image to avoid node disk pressure (https://github.com/kubernetes-sigs/kind/pull/663)
  • anything else?
apiVersion: v1
kind: Pod
metadata:
  name: dind-k8s
spec:
  containers:
    - name: dind
      image: <image>
      securityContext:
        privileged: true
      volumeMounts:
        - mountPath: /lib/modules
          name: modules
          readOnly: true
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - name: dind-storage
          mountPath: /var/lib/docker
  volumes:
  - name: modules
    hostPath:
      path: /lib/modules
      type: Directory
  - name: cgroup
    hostPath:
      path: /sys/fs/cgroup
      type: Directory
  - name: dind-storage
    emptyDir: {}

radu-matei avatar Aug 06 '19 09:08 radu-matei

Make sure you do kind delete cluster! See https://github.com/kubernetes-sigs/kind/issues/759

howardjohn avatar Aug 08 '19 18:08 howardjohn

That's pretty sane. As @howardjohn notes please make sure you clean up the top level containers in that pod (IE kind delete cluster in an exit trap or similar). DNS may also give you issues.

why kind needs the volume mounts and what impact they have on the underlying node infrastructure

  • /lib/modules is not strictly necessary, but a number of things want to probe these contents, and it's harmless to mount them. For clarity I would make this mount read-only. No impact.
  • cgroups are mounted because cgroupsv1 containers don't exactly nest. if we were just doing docker in docker we wouldn't need this.

what happens when the pod is terminated before deleting the cluster (in the context of #658 (comment))

It depends on your setup, with these mounts IIRC the processes / containers can leak. Don't do this. Have an exit handler, deleting the containers should happen within the grace period.

configuring garbage collection for unused image to avoid node disk pressure (#663)

You shouldn't need this in CI, kind clusters should be ephemeral. Please, please use them ephemerally. There are a number of ways kind is not optimized for production long lived clusters. For temporary clusters used during a test this is a non-issue.

Also note that turning on disk eviction risks your pods being evicted based on the disk usage of the host. There's a reason this is off by default. Eventually we will ship an alternative to make long lived clusters better, but for now it's best to not depend on long lived clusters or image GC.

anything else?

DNS (see above). Your outer cluster's in-cluster DNS servers are typically on a clusterIP which won't necessarily be visible to the containers in the inner cluster. Ideally configure the "host machine" Pod's DNS to your preferred upstream DNS provider (see above).

BenTheElder avatar Aug 14 '19 19:08 BenTheElder

@BenTheElder thank you for pointing me in this issue - I am trying to see how we would fit @radu-matei's example into the testing automation we are introducing for our kubernetes project. Right now we want to trigger the creation of the cluster and the commands within that cluster from within a pod. I've tried creating a container that has docker and kind installed.

I've tried creating a pod with the instructions provided above, but I still can't seem to run the kind create cluster command provided - I get the error:

root@k8s-builder-7b5cc87566-fnz5b:/work# kind create cluster
Error: could not list clusters: failed to list nodes: exit status 1

For testing I am currently creating the container, running kubectl exec into it and running kind create cluster.

The current pod specification I have is the following:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: k8s-builder
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: k8s-101
    spec:
      containers:
      - name: k8s-docker-builder
        image: seldonio/core-builder:0.4
        imagePullPolicy: Always
        command: 
        - tail 
        args:
        - -f 
        - /dev/null
        volumeMounts:
        - mountPath: /lib/modules
          name: modules
          readOnly: true
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - name: dind-storage
          mountPath: /var/lib/docker
        securityContext:
          privileged: true
      volumes:
      - name: modules
        hostPath:
          path: /lib/modules
          type: Directory
      - name: cgroup
        hostPath:
          path: /sys/fs/cgroup
          type: Directory
      - name: dind-storage
        emptyDir: {}

For explicitness, the way that I am installing Kind in the Dockerfile is as follows:

# Installing KIND
RUN wget https://github.com/kubernetes-sigs/kind/releases/download/v0.5.1/kind-linux-amd64 && \
    chmod +x kind-linux-amd64 && \
    mv ./kind-linux-amd64 /bin/kind

For explicitness, the way that I am installing Kubectl in the Dockerfile is as follows:

# Installing Kubectl
RUN wget https://storage.googleapis.com/kubernetes-release/release/v1.16.2/bin/linux/amd64/kubectl && \
    chmod +x ./kubectl && \
    mv ./kubectl /bin

For explicitness, the way that I am installing Docker in the Dockerfile is as follows:

# install docker
RUN \
    apt-get update && \
    apt-get install -y \
         apt-transport-https \
         ca-certificates \
         curl \
         gnupg2 \
         software-properties-common && \
    curl -fsSL https://download.docker.com/linux/$(. /etc/os-release; echo "$ID")/gpg | apt-key add - && \
    add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/$(. /etc/os-release; echo "$ID") \
       $(lsb_release -cs) \
       stable" && \
    apt-get update && \
    apt-get install -y docker-ce

What should i make sure I take into account to make sure this works?

axsaucedo avatar Oct 22 '19 19:10 axsaucedo

@axsaucedo can you verify that you started docker succesfully? failing to list clusters means docker ps does not work.

BenTheElder avatar Oct 22 '19 19:10 BenTheElder

That is correct, currently I am getting the usual Cannot connect to the Docker daemon at unix:///var/run/docker.sock.. What woudl be the way to make it work in the pod without mounting the Node's socket? Is there a cleaner/better way to do this?

axsaucedo avatar Oct 22 '19 19:10 axsaucedo

@BenTheElder I was able to successfully create a kind cluster by starting an internal docker service inside of the pod, which is a fantastic step forward, but I am not sure whether this is the internded use. I did have a look at the response you made in #997 where you pointed to wrapper.sh which actually does start the service itself, so I assume that is the correct/expected usage?

For sake of explicitness here is the comment you provided in #997 (very useful): https://github.com/kubernetes-sigs/kind/issues/997#issuecomment-545102002

axsaucedo avatar Oct 22 '19 19:10 axsaucedo

yes -- you need to start docker. for our CI image we handle this in the entrypoint

BenTheElder avatar Oct 23 '19 17:10 BenTheElder

Note that we appeared to experience a leak in the istio CI, it is important that you ensure that on exit all containers are delete. kind delete cluster should be sufficient, but we also recommend force removing all docker containers.

This image is WIP what kind's own Kubernetes based CI will be using. https://github.com/kubernetes/test-infra/tree/master/images/krte

Note this part https://github.com/kubernetes/test-infra/blob/4696b77f4ee7cfffe8e86a8b8e84c797d6846bfd/images/krte/wrapper.sh#L125

BenTheElder avatar Oct 25 '19 16:10 BenTheElder

Thank you very much @BenTheElder - we have managed to successfully run our e2e tests using the krte as base example, really appreciate the guidance (currently sitting at https://github.com/SeldonIO/seldon-core/pull/994). It's pretty mind blowing that it's now possible to run kubernetes in kubernetes to test kubernetes components in containerised kubernetes 🤯

axsaucedo avatar Oct 27 '19 15:10 axsaucedo

For future reference, here's a working pod spec for running kind in a pod: (Add your own image) (cc @BenTheElder - is this a sane pod spec for kind?)

That being said, there should also be documentation for:

  • why kind needs the volume mounts and what impact they have on the underlying node infrastructure
  • what happens when the pod is terminated before deleting the cluster (in the context of #658 (comment))
  • configuring garbage collection for unused image to avoid node disk pressure (#663)
  • anything else?
apiVersion: v1
kind: Pod
metadata:
  name: dind-k8s
spec:
  containers:
    - name: dind
      image: <image>
      securityContext:
        privileged: true
      volumeMounts:
        - mountPath: /lib/modules
          name: modules
          readOnly: true
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - name: dind-storage
          mountPath: /var/lib/docker
  volumes:
  - name: modules
    hostPath:
      path: /lib/modules
      type: Directory
  - name: cgroup
    hostPath:
      path: /sys/fs/cgroup
      type: Directory
  - name: dind-storage
    emptyDir: {}

using this pod configuration I am still getting a error from containerd:

Dec 11 04:31:22 kind-control-plane containerd[48]: time="2019-12-11T04:31:22.617994096Z" level=error msg="copy shim log" error="reading from a closed fifo"
Dec 11 04:31:22 kind-control-plane containerd[48]: time="2019-12-11T04:31:22.626806247Z" level=error msg="copy shim log" error="reading from a closed fifo"
Dec 11 04:31:22 kind-control-plane containerd[48]: time="2019-12-11T04:31:22.643253105Z" level=error msg="copy shim log" error="reading from a closed fifo"
Dec 11 04:31:22 kind-control-plane containerd[48]: time="2019-12-11T04:31:22.644244344Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-controller-manager-kind-control-plane,Uid:7a42efc8ddc98f327b58e75d0d6078b7,Namespace:kube-system,Attempt:0,} failed, error" error="failed to create containerd task: io.containerd.runc.v1: failed to adjust OOM score for shim: set shim OOM score: write /proc/368/oom_score_adj: invalid argument\n: exit status 1: unknown"
Dec 11 04:31:22 kind-control-plane containerd[48]: time="2019-12-11T04:31:22.645505766Z" level=error msg="copy shim log" error="reading from a closed fifo"
Dec 11 04:31:22 kind-control-plane containerd[48]: time="2019-12-11T04:31:22.646301955Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-kind-control-plane,Uid:051f0a138da15840d511b8f1d90c5bbf,Namespace:kube-system,Attempt:0,} failed, error" error="failed to create containerd task: io.containerd.runc.v1: failed to adjust OOM score for shim: set shim OOM score: write /proc/345/oom_score_adj: invalid argument\n: exit status 1: unknown"

This is the main issue also reported on kubelet while trying to start kube-apiserver

Kind version: master Local:

Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:22:34 2019
 OS/Arch:           darwin/amd64
 Experimental:      true

any help greatly appreciated

danielfbm avatar Dec 11 '19 04:12 danielfbm

Hello,

I read the thread, and apply all you recommendations. I added the cluster deletion on my kind docker image.

Just to be sure, and add an other protection, (for eventual kill -9) do you have a /sys/fs/cgroup clean up script ? Not sure if it's technically possible, I guess no, but maybe I miss something.

quentin9696 avatar Feb 25 '20 20:02 quentin9696

we do not, we do not expect a kill -9 to occur ... we certainly do not do so manually, and that's not how kubernetes terminates pods normally AFAIK.

if it did though, our host machines are regularly replaced (k8s upgrades and node auto repair in GKE involve replacing the node VMs entirely)

BenTheElder avatar Feb 25 '20 20:02 BenTheElder

however, this problem is one of the reasons I do NOT recommend doing this if you can avoid it. on a host like say circleCI, google cloud build, travis, ... you will not have this problem as the underlying VM only exists for the test invocation.

IF your CI infrastructure must kubernetes based (instead of just your app infra) privileged containers and kind can let you run kubernetes end to end tests, but it is not without issues.

BenTheElder avatar Feb 25 '20 20:02 BenTheElder

For some reasons, docker can send a SIGKILL after the grace period.

I run kind on a CI (Jenkins) who run on GKE. It's not a big issue to lost jenkins while we wait the new pod on an other worker.

Thanks for your reply

quentin9696 avatar Feb 25 '20 21:02 quentin9696

To be clearer: we will have cleaned up before the grace period normally. We trap SIGTERM with cleanup.

On Tue, Feb 25, 2020, 13:29 quentin9696 [email protected] wrote:

For some reasons, docker can send a SIGKILL after the grace period.

I run kind on a CI (Jenkins) who run on GKE. It's not a big issue to lost jenkins while we wait the new pod on an other worker.

Thanks for your reply

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kind/issues/303?email_source=notifications&email_token=AAHADK7ZDYUINJ745AACBILREWER3A5CNFSM4GXS3QMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5SIYY#issuecomment-591078499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADKYXGWQVEYQXF2TUAKTREWER3ANCNFSM4GXS3QMA .

BenTheElder avatar Feb 26 '20 05:02 BenTheElder

We moved our cgroup mount to read only a few months back and haven't had any issues. It removed any risk of things not cleaning up properly (I think? We still clean up and now our nodes restart often for other reasons, so maybe it doesn't and I just don't notice)

howardjohn avatar Feb 26 '20 16:02 howardjohn

An additional note for why this can be problematic: /dev/loop* devices are NOT namespaced / are shared with the host. This is a problem if you're trying to do blockfs testing (like we do in kubernetes). AIUI minikube does not support block devices at all but, if for some reason you're trying to test block devices with local clusters, you're going to need to work around this by preallocating sufficient block devices.

https://github.com/kubernetes/test-infra/blob/dfe2d0f383c8f6df6cc2e53ca253d048e18dcfe2/prow/cluster/create-loop-devs_daemonset.yaml

BenTheElder avatar Mar 15 '20 20:03 BenTheElder

FYI, wrote a blog post about this to share our experiences: https://d2iq.com/blog/running-kind-inside-a-kubernetes-cluster-for-continuous-integration

jieyu avatar May 13 '20 20:05 jieyu

Could the Dockerfile for jieyu/kind-cluster-buster:v0.1.0 be open sourced by any chance?

deiwin avatar May 14 '20 06:05 deiwin

@deiwin it’s open sourced. There’s a link in the blog post to the repo. https://github.com/jieyu/docker-images

jieyu avatar May 14 '20 15:05 jieyu

Oh, didn't notice that. Thank you!

deiwin avatar May 14 '20 16:05 deiwin

@jieyu thanks for spending the time to refactor into an OSS repo and write that comprehensible blog post! We also have a production cluster running with KIND in kubernetes, but we'll be looking to refactor it using some of your techniques. I have a question, how come in none of your scripts you actually delete the KIND cluster? From previous posts/implementations that Ben has covered one of the main emphasis is to ensure the KIND cluster is deleted, otherwise there may be dangling resources. In our implementation we do remove the KIND cluster, and then we run service docker stop however it sometimes hangs, and we were thinking of just running the KIND delete without the service docker stop, hence why I am also curious of your implementation. Thanks again!

axsaucedo avatar Jul 06 '20 18:07 axsaucedo

@axsaucedo I believe that one of the main reasons that previous implementations require deleting the KIND cluster is to make sure cgroups are cleaned up (thus no leaked) on the host cgroup filesystem. The way we solved it is to place docker daemons's root cgroup nested underneath the corresponding pod cgroup (i.e., https://github.com/jieyu/docker-images/blob/master/dind/entrypoint.sh#L64). Thus, when the pod (with KIND running inside it) is terminated by Kubernetes, all the associated resources are cleaned up, including the cgroups used by the KIND cluster.

There might be other "shared" global kernel resources used by KIND that's not properly "nested" under the pod (e.g., devices), which means that they might get leaked if the KIND cluster is not cleaned up properly in the pod. However, we don't have such workload in our CI, thus no need to worry about those in our case.

jieyu avatar Jul 06 '20 18:07 jieyu

Right I see @jieyu, that makes sense. Ok fair enough, that's quite a solid approach. We'll be looking to refactor our implementation using the approach you outlined in the blogpost as base. Thanks again for this!

axsaucedo avatar Jul 06 '20 18:07 axsaucedo

There might be other "shared" global kernel resources used by KIND that's not properly "nested" under the pod (e.g., devices), which means that they might get leaked if the KIND cluster is not cleaned up properly in the pod. However, we don't have such workload in our CI, thus no need to worry about those in our case.

Right, in Kubernetes's CI (and others) this is not the case. I still strongly recommend at least best-effort attempting to shut things down gracefully. I also strongly suggest reconsidering trying to run Kubernetes inside Kubernetes for anything serious.

BenTheElder avatar Jul 08 '20 06:07 BenTheElder