k3d
k3d copied to clipboard
[BUG] Pod network failing to start when installing calico operator with k3d v5.2.1
What did you do
-
How was the cluster created?
-
k3d cluster create "k3s-default" --k3s-arg '--flannel-backend=none@server:*'
-
-
What did you do afterwards? I tried to install the calico or tigera operator onto the cluster with
containerIPForwarding
enabled.
kubectl apply -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
curl -L https://docs.projectcalico.org/manifests/custom-resources.yaml > k3d-custom-res.yaml
yq e '.spec.calicoNetwork.containerIPForwarding="Enabled"' -i k3d-custom-res.yaml
kubectl apply -f k3d-custom-res.yaml
-
k3d commands?
-
docker commands?
docker ps
to check running containersdocker exec -ti <node> /bin/sh
to ssh into a container -
OS operations (e.g. shutdown/reboot)? Ran linux system cmds (ls, cat, etc...) inside pods and containers
What did you expect to happen
The pod network should be up and running successfully in all namespaces. All pods are in the running state.
Screenshots or terminal output
The calico-nodes are able to run without issue but other containers are stuck in the ContainerCreating
state (coredns, metrics, calico-kube-controller)
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
tigera-operator tigera-operator-7dc6bc5777-h5sp7 1/1 Running 0 106s
calico-system calico-typha-9b59bcc69-w2ml8 1/1 Running 0 83s
calico-system calico-kube-controllers-78cc777977-8xf5v 0/1 ContainerCreating 0 83s
kube-system coredns-7448499f4d-8pwtf 0/1 ContainerCreating 0 106s
kube-system metrics-server-86cbb8457f-h26x4 0/1 ContainerCreating 0 106s
kube-system helm-install-traefik-h6qhh 0/1 ContainerCreating 0 106s
kube-system helm-install-traefik-crd-8xsxm 0/1 ContainerCreating 0 106s
kube-system local-path-provisioner-5ff76fc89d-ql55s 0/1 ContainerCreating 0 106s
calico-system calico-node-6xbq7 1/1 Running 0 83s
When describing the stuck pods, I see this in its events:
$ kubectl describe pod/calico-kube-controllers-78cc777977-8xf5v -n calico-system
Warning FailedCreatePodSandBox 3s kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b474a530f7b8727fc101404ebb551135059f5aa359beb50bae176fd05cf2c20d": netplugin failed with no error message: fork/exec /opt/cni/bin/calico: no such file or directory
Based on the error above, I went to check /opt/cni/bin/calico
to see if the calico binary existed in the container, which it does:
glen@glen-tigera: $ docker exec -ti k3d-k3s-default-server-0 /bin/sh
/ # ls
bin dev etc k3d lib opt output proc run sbin sys tmp usr var
/ # cd /opt/cni/bin/
/opt/cni/bin # ls -a
. .. bandwidth **calico** calico-ipam flannel host-local install loopback portmap tags.txt tuning
CNI Config Yaml:
kubectl get cm cni-config -n calico-system -o yaml
apiVersion: v1
data:
config: |-
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"datastore_type": "kubernetes",
"mtu": 0,
"nodename_file_optional": false,
"log_level": "Info",
"log_file_path": "/var/log/calico/cni/cni.log",
"ipam": { "type": "calico-ipam", "assign_ipv4" : "true", "assign_ipv6" : "false"},
"container_settings": {
"allow_ip_forwarding": true
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"k8s_api_root":"https://10.43.0.1:443",
"kubeconfig": "__KUBECONFIG_FILEPATH__"
}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
},
{"type": "portmap", "snat": true, "capabilities": {"portMappings": true}}
]
}
kind: ConfigMap
metadata:
creationTimestamp: "2021-12-17T18:02:24Z"
name: cni-config
namespace: calico-system
ownerReferences:
- apiVersion: operator.tigera.io/v1
blockOwnerDeletion: true
controller: true
kind: Installation
name: default
uid: c53d18b5-efc6-4155-879b-6097a8c2c14c
resourceVersion: "675"
uid: 003c9cdc-0ef5-4d63-8d30-d6e1ed79d4c0
Which OS & Architecture
OS: GNU/Linux Kernel Version: 20.04.2-Ubuntu SMP Kernel Release: 5.11.0-40-generic Processor/HW Platform/Machine Architecture: x86_64
Which version of k3d
k3d version v5.2.1 k3s version v1.21.7-k3s1 (default)
Which version of docker
docker version:
Client: Docker Engine - Community
Version: 20.10.11
API version: 1.41
Go version: go1.16.9
Git commit: dea9396
Built: Thu Nov 18 00:37:06 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.11
API version: 1.41 (minimum version 1.12)
Go version: go1.16.9
Git commit: 847da18
Built: Thu Nov 18 00:35:15 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: `de40ad0`
docker info:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.6.3-docker)
scan: Docker Scan (Docker Inc., v0.9.0)
Server:
Containers: 20
Running: 0
Paused: 0
Stopped: 20
Images: 22
Server Version: 20.10.11
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc version: v1.0.2-0-g52b36a2
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.11.0-40-generic
Operating System: Ubuntu 20.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 31.09GiB
Name: glen-tigera
ID: 6EZ7:QGFF:Z2KK:Q7K3:YKGI:6FIS:X2UP:JX5W:UGXA:FIZW:CYV6:RDDU
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
@Glen-Tigera and I work for Tigera on Calico - we're trying to get operator install working with k3d (so we can add it to our overnight test runs).
We're baffled by this. From our side, we can see the calico binaries and config being written to the node, but still kubelet is complaining that it can't find the files.
We tried operator install with k3s, and that works fine, so I don't think its the OS.
Just wondered if you had any tips for what to try next.
The instructions in the k3d docs seem to work fine, i.e. https://k3d.io/v5.0.0/usage/advanced/calico.yaml works.
Comparing https://k3d.io/v5.0.0/usage/advanced/calico.yaml with https://docs.projectcalico.org/archive/v3.15/manifests/calico.yaml, we see:
lance@lwr20:~/scratch$ diff k3d_calico.yaml orig_calico.yaml
37,39d36
< "container_settings": {
< "allow_ip_forwarding": true
< },
398a396,405
> allowIPIPPacketsFromWorkloads:
> description: 'AllowIPIPPacketsFromWorkloads controls whether Felix
> will add a rule to drop IPIP encapsulated traffic from workloads
> [Default: false]'
> type: boolean
> allowVXLANPacketsFromWorkloads:
> description: 'AllowVXLANPacketsFromWorkloads controls whether Felix
> will add a rule to drop VXLAN encapsulated traffic from workloads
> [Default: false]'
> type: boolean
2095c2102
< If not specified, then this is defaulted to "Never" (i.e. IPIP tunneling
---
> If not specified, then this is defaulted to "Never" (i.e. IPIP tunelling
2115c2122
< tunneling is disabled).
---
> tunelling is disabled).
3451,3452d3457
< - key: node-role.kubernetes.io/master
< effect: NoSchedule
3463c3468
< image: calico/cni:v3.15.0
---
> image: calico/cni:v3.15.5
3485c3490
< image: calico/cni:v3.15.0
---
> image: calico/cni:v3.15.5
3521c3526
< image: calico/pod2daemon-flexvol:v3.15.0
---
> image: calico/pod2daemon-flexvol:v3.15.5
3532c3537
< image: calico/node:v3.15.0
---
> image: calico/node:v3.15.5
3586,3591d3590
< # Set MTU for the Wireguard tunnel device.
< - name: FELIX_WIREGUARDMTU
< valueFrom:
< configMapKeyRef:
< name: calico-config
< key: veth_mtu
3725c3724
< image: calico/kube-controllers:v3.15.0
---
> image: calico/kube-controllers:v3.15.5
re. the differences:
- The FELIX_WIREGUARDMTU setting looks like its duplicated in the k3d manifest, so I'd hope that that wasn't the issue.
- There's a toleration on the k3d manifest. But we see calico-node running in the output above, so I don't think that's it?
- "allow_ip_forwarding": true in the k3d manifest, which we've enabled in the custom-resources by setting
spec.calicoNetwork.containerIPForwarding="Enabled"
Hi @Glen-Tigera & @lwr20 , thanks for moving the issue over here from Slack :+1: I just gave the setup a try myself and obviously see the same issues as you. FWIW, I checked the logs of the node and see hundreds of lines like the following:
E1220 09:59:37.232636 7 plugins.go:748] Error dynamically probing plugins: Error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input
E1220 09:59:37.232851 7 driver-call.go:266] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
W1220 09:59:37.232855 7 driver-call.go:149] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: no such file or directory, output: ""
Awesome, thank you, that gives us a thread to pull on.
Googling for that error message, this issue in rke2 popped up: https://github.com/rancher/rke2/issues/234
From https://projectcalico.docs.tigera.io/reference/installation/api, I think this all means we need to set
spec.flexVolumePath: "/usr/local/bin/"
in the Installation resource in custom-resources
I could actually confirm, that the file is where it belongs:
docker exec -it k3d-k3s-default-server-0 stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
Alias tip: deti k3d-k3s-default-server-0 stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
File: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
Size: 4987070 Blocks: 9744 IO Block: 4096 regular file
Device: 68h/104d Inode: 31471848 Links: 1
Access: (0550/-r-xr-x---) Uid: ( 0/ UNKNOWN) Gid: ( 0/ UNKNOWN)
Access: 2021-12-20 12:09:10.946061593 +0000
Modify: 2021-12-20 12:09:10.762060830 +0000
Change: 2021-12-20 12:09:10.766060847 +0000
Birth: -
Do you have any idea what https://github.com/rancher/rke2-charts/pull/20/files actually does?
@iwilltry42 When I ran the command your posted earlier, there was no such file or directory on my setup:
$ docker exec -it k3d-k3s-default-server-0 stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
stat: cannot stat '/usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds': No such file or directory
There is no nodeagent~uds
directory when I try to look inside the container:
$ docker exec -ti k3d-k3s-default-server-0 ls -a /usr/libexec/kubernetes/kubelet-plugins/volume/exec
. ..
Do you have any idea what https://github.com/rancher/rke2-charts/pull/20/files actually does?
No idea, but since that's canal
, I guess it's not valid for K3s which usually runs flannel.
There is no
nodeagent~uds
directory when I try to look inside the container:$ docker exec -ti k3d-k3s-default-server-0 ls -a /usr/libexec/kubernetes/kubelet-plugins/volume/exec . ..
False alarm, this is because I specified spec.flexVolumePath: "/usr/local/bin/"
in the installation
@lwr20 and I have looked into the /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/
directory inside the container. It contains the uds file
/ # ls -l /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/
total 4872
-r-xr-x--- 1 0 0 4987070 Dec 20 15:25 uds
But it seems like we can't run it though:
/ # /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
/bin/sh: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: not found
Do you have any idea what rancher/rke2-charts#20 (files) actually does?
Sorry, just understood how you got there. This is the script that's being executed: https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker/flexvol.sh
I'm checking the variants of installation now (the one from k3d docs and yours) with regards to the uds:
Via Operator:
/ # ls -lah /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
total 4.8M
drwxr-xr-x 2 0 0 4.0K Dec 21 06:49 .
drwxr-xr-x 3 0 0 4.0K Dec 21 06:49 ..
-r-xr-x--- 1 0 0 4.8M Dec 21 06:49 uds
/ # stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
File: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
Size: 4987070 Blocks: 9744 IO Block: 4096 regular file
Device: 37h/55d Inode: 43271409 Links: 1
Access: (0550/-r-xr-x---) Uid: ( 0/ UNKNOWN) Gid: ( 0/ UNKNOWN)
Access: 2021-12-21 06:49:35.595982143 +0000
Modify: 2021-12-21 06:49:35.531982019 +0000
Change: 2021-12-21 06:49:35.531982019 +0000
Birth: -
/ # /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
sh: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: not found
Without Operator:
/ # ls -lah /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
total 5.4M
drwxr-xr-x 2 0 0 4.0K Dec 21 06:52 .
drwxr-xr-x 3 0 0 4.0K Dec 21 06:52 ..
-r-xr-x--- 1 0 0 5.4M Dec 21 06:52 uds
/ # stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
File: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
Size: 5602363 Blocks: 10944 IO Block: 4096 regular file
Device: 37h/55d Inode: 42735669 Links: 1
Access: (0550/-r-xr-x---) Uid: ( 0/ UNKNOWN) Gid: ( 0/ UNKNOWN)
Access: 2021-12-21 06:52:46.752353250 +0000
Modify: 2021-12-21 06:52:46.092351969 +0000
Change: 2021-12-21 06:52:46.100351984 +0000
Birth: -
/ # /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
Usage:
flexvoldrv [command]
Available Commands:
help Help about any command
init Flex volume init command.
mount Flex volume unmount command.
unmount Flex volume unmount command.
version Print version
Flags:
-h, --help help for flexvoldrv
Use "flexvoldrv [command] --help" for more information about a command.
Regarding the differences between the deployed manifests: The DaemonSet is handled by the operator and rewrites the image tags from v3.15.5 to v3.21.2.
I see that quite some things changed there also around the flexvol part, especially since pod2daemon was included in the monorepo :thinking:
I tried to use the ImageSet
to get back to v3.15.0 for testing with the operator, but then the expected path of e.g. the install-cni
script is wrong :thinking:
Ah - the way that the tigera-operator works, there's a version of operator that maps to a version of calico (since the manifests are baked into it). For v3.15, you'll want to apply: https://docs.projectcalico.org/archive/v3.15/manifests/tigera-operator.yaml
(the intent is to make the upgrade experience better - in an operator managed cluster, you upgrade calico by simply applying the uplevel tigera-operator.yaml and it takes care of everything). In the old manifest install, you'd have customised your install in various ways directly in the yaml, so to upgrade you have to get the new yaml, then make the same edits as you did before, then apply and hope you did it right. Whereas in an operator setup, you have configured all your customisations in the Installation resource. The new operator reads that and does "the right thing" to apply those customisations.
I tried installing an older version of the operator and CRD (v3.15) on the k3d cluster. That was working so its possible the issue could be on our side.
k3d cluster create "k3d-test-cluster-3-15" --k3s-arg "--flannel-backend=none@server:*" --k3s-arg "--no-deploy=traefik@server:*"
kubectl apply -f https://docs.projectcalico.org/archive/v3.15/manifests/tigera-operator.yaml
kubectl apply -f https://docs.projectcalico.org/archive/v3.15/manifests/custom-resources.yaml
$ kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tigera-operator tigera-operator-5bf967b87f-8528g 1/1 Running 0 4m 192.168.160.3 k3d-test-cluster-3-15-server-0 <none> <none>
calico-system calico-typha-fb8798b8f-kmpqf 1/1 Running 0 3m44s 192.168.160.2 k3d-test-cluster-3-15-agent-0 <none> <none>
calico-system calico-kube-controllers-c8496f5c-67ljz 1/1 Running 0 3m44s 192.168.48.65 k3d-test-cluster-3-15-server-0 <none> <none>
kube-system local-path-provisioner-5ff76fc89d-rfsrf 1/1 Running 0 4m7s 192.168.48.66 k3d-test-cluster-3-15-server-0 <none> <none>
kube-system coredns-7448499f4d-9zfh6 1/1 Running 0 4m7s 192.168.48.67 k3d-test-cluster-3-15-server-0 <none> <none>
kube-system metrics-server-86cbb8457f-j9rs4 1/1 Running 0 4m7s 192.168.48.68 k3d-test-cluster-3-15-server-0 <none> <none>
calico-system calico-node-rg6tv 1/1 Running 0 3m44s 192.168.160.2 k3d-test-cluster-3-15-agent-0 <none> <none>
calico-system calico-node-66fzx 1/1 Running 0 3m44s 192.168.160.3 k3d-test-cluster-3-15-server-0 <none> <none>
calico-system calico-typha-fb8798b8f-pb2wj 1/1 Running 0 105s 192.168.160.3 k3d-test-cluster-3-15-server-0 <none> <none>
Upon further testing, our v3.21 (latest release) operator install seems to no longer be compatible with k3d clusters. I tested the operator starting from v3.15 and every version was working till v3.21. Followed up with the larger team to discuss further.
Ah at least you could track it down to a specific version already 👍 Fingers crossed you'll figure out the root cause.
We ran into this with K3s as well. The same exact issue using the operator.
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5+k3s1", GitCommit:"405bf79da97831749733ad99842da638c8ee4802", GitTreeState:"clean", BuildDate:"2021-12-18T00:43:30Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
kube-system svclb-traefik-7wk8q 0/2 CrashLoopBackOff 10 (59s ago) 4m [root@k3s-master ~]# kubectl logs svclb-traefik-7wk8q Error from server (NotFound): pods "svclb-traefik-7wk8q" not found [root@k3s-master ~]# kubectl logs -n kube-system svclb-traefik-7wk8q error: a container name must be specified for pod svclb-traefik-7wk8q, choose one of: [lb-port-80 lb-port-443] [root@k3s-master ~]# kubectl logs -n kube-system svclb-traefik-7wk8q -c lb-port-80
- trap exit TERM INT
- echo 10.43.233.214
- grep -Eq :
- cat /proc/sys/net/ipv4/ip_forward
- '[' 0 '!=' 1 ]
- exit 1
With the operator there is no way we found to set the :
"container_settings": { "allow_ip_forwarding": true }
Setting. We changed it in: vi /etc/cni/net.d/10-calico.conflist
and changed it in the cmi-config CM. The value kept getting changed back we assume by the operator.
Henlo friends.
TLDR: I fixed my problem by pulling down fresh calico and calico-ipam executables. as per calico CNI install docs.
Ok so I've been trialing k3os as a platform replacement, which led me here as I was working with calico, my findings may be of assistance.
I'd disabled flannel (and Traefik and sericelb) when k3os installed. Next I added calico using the operator as per the calico instructions. EXACTLY the same problem as above, the /opt/cni/bin/calico was there but no one was happy about it.
So I pulled the two executables listed in the basic calico install instructions* here and tada up came the calico-system-controller and moments later the two calico api servers!
look for the binary paths to download under the block of text 'Install the CNI plugin Binaries'
I tried a bunch of different configurations after I solved the problem.
- originally I had added the
containerIPForwarding: Enabled
fellow. Turns out I could remove this and adding the operator and cr's seemed ok (note when I say ok the containers are running thats as far as ive gone). -
calicoNetwork.ipPools.cidr
can overlap or not. It doesn't seem to impact the pods coming up.
I hope this is of help to all or some of you as your work above really helped me get through.