talos
talos copied to clipboard
talosctl: connection errors when waiting for API while provisioning a local qemu cluster
Bug Report
Description
When I try to provision a local qemu cluster on Fedora 35 I get network connection errors when talosctl waits for the API. If I disable the firewall with sudo systemctl stop firewalld and restart docker sudo systemctl restart docker the problem disappears.
Logs
> sudo -E _out/talosctl-linux-amd64 cluster create --provisioner=qemu --cidr=172.20.0.0/24 --registry-mirror docker.io=http://172.20.0.1:5000 --registry-mirror k8s.gcr.io=http://172.20.0.1:5001 --registry-mirror quay.io=http://172.20.0.1:5002 --registry-mirror gcr.io=http://172.20.0.1:5003 --registry-mirror ghcr.io=http://172.20.0.1:5004 --registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005 --install-image=127.0.0.1:5005/siderolabs/installer:v1.1.0-alpha.1-54-g129f3e6e2-dirty --masters 1 --workers 0 --with-bootloader=true --wait --config-patch '[{"op":"add","path":"/machine/install/extensions","value":[{"image":"127.0.0.1:5005/siderolabs/hello-world-service:v1.0.0-5-g3ccc1b5-dirty"}]}]'
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/root/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-23"
waiting for API
bootstrap error: 4 error(s) occurred:
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.0.2:50000: connect: connection refused"
rpc error: code = DeadlineExceeded desc = context deadline exceeded
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.0.2:50000: connect: no route to host"
timeout
Environment
- Talos Installer: v1.1.0-alpha.1-54
- Platform: Fedora release 35 (Thirty Five)
Do you have any networks claiming same CIDR? ip addr show
I would also check docker network ls
Do you have any networks claiming same CIDR?
ip addr showI would also check
docker network ls
Both of those look good to me.
I see this in journalctl -x -e while provisioning the cluster:
May 13 11:37:22 robert kernel: IN_public_REJECT: IN=talos3d3a9d82 OUT= MAC= SRC=172.20.0.1 DST=172.20.0.255 LEN=542 TOS=0x00 PREC=0x00 TTL=64 ID=28060 DF PROTO=UDP SPT=44842 DPT=21027 LEN=522
:shrug: talosctl is using standard CNI plugins to set up the bridge and networking...
this packet looks to be DHCP packet probably? It's still local to the node (should be)
When I run sudo -E ./_out/talosctl-linux-amd64 cluster destroy --provisioner=qemu
I see this in the systemd logs of firewalld:
firewalld[902]: ERROR: UNKNOWN_SOURCE: '172.20.0.2/32' is not in any zone
I got one small step ahead. Previously the log of the master node would repeat this forever:
[ 72.414966] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}
When I do this, gets past that:
sudo firewall-cmd --permanent --zone=trusted --add-interface=talos3d3a9d82
systemctl restart firewalld
Now it repeats this a lot:
[ 325.284751] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 1.1.1.1:53: read udp 172.20.0.2:42657->1.1.1.1:53: i/o timeout"}
I am having a similar issue, talosctl never gets past waiting for API with qemu. Qemu nodes refuse to communicate with each other. 10.5.0.1 pings perfectly fine but 10.5.0.2-5 don't.
Docker initially also didn't work, but after disabling firewalld and restarting docker as suggested here it worked.
journalctl didn't really give any clear errors.
Running Alma Linux
I have the same Issue on openSUSE tumblewee. Starting the cluster in Qemu quits with an error message:
» sudo -E talosctl cluster create --provisioner=qemu --with-uefi=false
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/root/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-4"
waiting for API
bootstrap error: 3 error(s) occurred:
rpc error: code = DeadlineExceeded desc = context deadline exceeded
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.5.0.2:50000: connect: no route to host"
timeout
checking the network with nmap shows only 10.5.0.1 beeing reachable:
» nmap -sn 10.5.0.1/24
Starting Nmap 7.92 ( https://nmap.org ) at 2022-06-20 22:24 CEST
Nmap scan report for 10.5.0.1
Host is up (0.00018s latency).
Nmap done: 256 IP addresses (1 host up) scanned in 3.10 seconds
While the logs of both the master and the worker node show this error:
[ 3.610283] [talos] fetching machine config from: "http://10.5.0.1:45685/config.yaml"
[ 3.610953] [talos] retrying error: Get "http://10.5.0.1:45685/config.yaml": dial tcp 10.5.0.1:45685: connect: network is unreachable
[ 8.264910] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:45773->[::1]:53: read: connection refused"}
[ 9.270952] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}
When I destroy the cluster, I can see this error with journalctl -x -e:
firewalld[1057]: ERROR: UNKNOWN_SOURCE: '10.5.0.3/32' is not in any zone
This is likely the same error that @sauterp faces. I would speculate that the error lies in setting up the vm's network interface or connecting it to the network bridge.
Setting up a cluster with docker works without problems.
I got one small step ahead. Previously the log of the master node would repeat this forever:
[ 72.414966] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}When I do this, gets past that:
sudo firewall-cmd --permanent --zone=trusted --add-interface=talos3d3a9d82 systemctl restart firewalldNow it repeats this a lot:
[ 325.284751] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 1.1.1.1:53: read udp 172.20.0.2:42657->1.1.1.1:53: i/o timeout"}
With this fix, I get past fetching the config, but the vm's are stuck pulling the installer image from ghcr and then timing out getting ntp time from the next IP:
[ 4.854735] [talos] task install (1/1): starting
[ 5.027136] [talos] pulling "ghcr.io/siderolabs/installer:v1.0.5"
INFO[00[ 5.030125] [talos] retrying error: failed to pull image "ghcr.io/siderolabs/installer:v1.0.5": failed to resolve reference "ghcr.io/siderolabs/installer:v1.0.5": failed to do request: Head "https://ghcr.io/v2/siderolabs/installer/
manifests/v1.0.5": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:43372->[::1]:53: read: connection refused
03] trying next host error="failed to do request: Head \"https://ghcr.io/v2/siderolabs/installer/manifests/v1.0.5\": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:43372->[::1]:53: read: connection refused"
host=ghcr.io
INFO[0003] trying next host error="failed to do request: Head \"https://ghcr.io/v2/siderolabs/installer/manifests/v1.0.5\": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:44487->[::1]:53: read: connection r
efused" host=ghcr.io
[ 5.044540] [talos] retrying error: failed to pull image "ghcr.io/siderolabs/installer:v1.0.5": failed to resolve reference "ghcr.io/siderolabs/installer:v1.0.5": failed to do request: Head "https://ghcr.io/v2/siderolabs/installer/manifes
ts/v1.0.5": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:44487->[::1]:53: read: connection refused
[ 5.280540] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:57565->[::1]:53: read: connection refused"}
[ 6.285757] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:45597->[::1]:53: read: connection refused"}
[ 7.290855] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:39912->[::1]:53: read: connection refused"}
[ 28.300310] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 1.1.1.1:53: read udp 10.5.0.2:36277->1.1.1.1:53: i/o timeout"}
@hobyte How did you get the log output from the qemu vms?
My .log files only contain the startup command, and talosctl --talosconfig talos-config -n 10.5.0.2 dmesg failed to determine endpoints
See the Development Guide
Mine only contain the setup command
❯ tail -F talos-alderson-*.log
==> talos-alderson-master-1.log <==
starting /usr/libexec/qemu-kvm with args:
-m 2048 -smp cpus=1 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=d2:98:94:66:6a:1c -device virtio-rng-pci -device virtio-balloon,deflate-on-oom=on -monitor unix:/root/.talos/clusters/talos-alderson/talos-alderson-master-1.monitor,server,nowait -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=d253bb1f-f785-443a-9aab-bacbec385485 -drive format=raw,if=virtio,file=/root/.talos/clusters/talos-alderson/talos-alderson-master-1-0.disk,cache=unsafe -machine q35,accel=kvm -drive file=/root/.talos/clusters/talos-alderson/talos-alderson-master-1-flash0.img,format=raw,if=pflash -kernel /home/amidala/talos/images/v1.1.0-beta2/vmlinuz-amd64 -initrd /home/amidala/talos/images/v1.1.0-beta2/initramfs-amd64.xz -append init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 random.trust_cpu=on printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 console=ttyS0 reboot=k panic=1 talos.shutdown=halt talos.platform=metal talos.config=http://10.5.0.1:39589/config.yaml
==> talos-alderson-worker-1.log <==
starting /usr/libexec/qemu-kvm with args:
-m 2048 -smp cpus=1 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=7a:fc:f9:e2:56:6f -device virtio-rng-pci -device virtio-balloon,deflate-on-oom=on -monitor unix:/root/.talos/clusters/talos-alderson/talos-alderson-worker-1.monitor,server,nowait -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=42f7131e-cf02-44bf-9d0e-c3afeda7d542 -drive format=raw,if=virtio,file=/root/.talos/clusters/talos-alderson/talos-alderson-worker-1-0.disk,cache=unsafe -machine q35,accel=kvm -drive file=/root/.talos/clusters/talos-alderson/talos-alderson-worker-1-flash0.img,format=raw,if=pflash -kernel /home/amidala/talos/images/v1.1.0-beta2/vmlinuz-amd64 -initrd /home/amidala/talos/images/v1.1.0-beta2/initramfs-amd64.xz -append init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 random.trust_cpu=on printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 console=ttyS0 reboot=k panic=1 talos.shutdown=halt talos.platform=metal talos.config=http://10.5.0.1:45587/config.yaml
Interesting, how long do you wait for the logs? It might take some time for Talos to start up.
I just discovered the load balancer cannot reach the vms:
$ cat clusters/talos-default/lb.log
2022/06/26 00:00:07 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:08 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:09 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:10 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:11 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:12 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:13 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:14 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:15 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:16 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:17 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:18 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:19 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:20 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:21 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:22 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:23 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:24 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
.
.
.
Is this normal behavior @smira when the api is not reachable?
I think it's expected to happen in the beginning, before the cluster has bootstrapped, but after a few min the kubelets should be running and it should resolve. << my qemu cluster never gets to that place
Im just out of Ideas on how to find the problem. I think its the cni, but I see no logs and don't know how to debug it.
Yeah I think so too. I moved on to playing with Talos on virtual private servers. Its distro related though, works fine on other distros
Yeah, I tested it on Ubuntu and it works fine. It just bothers me that i cannot get behind the problem. It would be nice to not have to use Ubuntu to develop talos
What you could try is to instead of using the qemu provisioner, manually create qemu-kvm's using the talos ISO
I have one Problem Identified: Deactivating firewalld solves the connection problem on my opensuse system. Now on to finding a solution.
I tried that, only allowed me to use docker, not qemu. Does qemu fully work for you now?
yes, the installation finished without problems. I quickly tested running a hello-world example example and installing the kubernetes dashboard and they both work 😀
I'm not an expert in RedHat-based distros, talosctl cluster create works just fine on Debian-based distros.
What we do is use CNI plugins to bring up the bridge on the host, assign IP to it, and bring up a pair of veth devices for each node. One end of the pair is connected to the bridge, another end is in separate network namespace, connected to the tap device which is connected directly to QEMU.
I'm not an expert in RedHat-based distros,
talosctl cluster createworks just fine on Debian-based distros.What we do is use CNI plugins to bring up the bridge on the host, assign IP to it, and bring up a pair of
vethdevices for each node. One end of the pair is connected to the bridge, another end is in separate network namespace, connected to thetapdevice which is connected directly to QEMU.
Okay, I might Investigate the cni more. Is there a configuration for it, especially the routing?
It should be pretty much default bridge CNI config: https://github.com/siderolabs/talos/blob/master/pkg/provision/providers/vm/network.go#L153-L193
Nothing fancy here, and CNI plugin is supposed to set up things with the quirks for each OS.
It's not the cni. But I had some problems with the cashing registries and the problem has the same source:
firewalld prevents talosctl from connecting with the vms.
-> Workaround: stop firewalld
cashing registries don't work when firewalld is stoppen
Create registry proxies with stopped firewalld:
» hack/start-registry-proxies.sh simon@edora
b5cab21e0024c71c4d69c269b9e3d5f0d843611b1542f8de27fcd51a73cd10b6
docker: Error response from daemon: driver failed programming external connectivity on endpoint registry-docker.io (e2f979cf543dd532e9211bcc28e548bd0fa44cd264d8d7f27c4bde41ca6592cb): (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 5000 -j DNAT --to-destination 172.17.0.2:5000 ! -i docker0: iptables: No chain/target/match by that name.
(exit status 1)).
-> firewalld must be active
Summary
Add firewalld rules allowing connections from talosctl to VMs and Vice versa, Ideally, talosctl cliuster create will set them.
For what it's worth... bridge CNI plugin injects iptables rules to do the NAT. It feels the issue is communication between the VMs and the host IPs.
Workaround
(tested on openSUSE and fedora)
Stop firewalld
systemctl stop firewalld
run talosctl cluster create without cashing registries
sudo --preserve-env=HOME _out/talosctl-linux-amd64 cluster create --provisioner=qemu --cidr=172.20.0.0/24 --registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005 --install-image=127.0.0.1:5005/siderolabs/installer:v1.1.0-alpha.2-61-g87e7de30c --masters 3 --workers 2 --with-bootloader=false
@smira I'm trying to understand the CreateNetwork function, especially where the bridge and the network are created. Do you know which line the configs are applied, I can only see the creation of the configs.
@smira I'm trying to understand the CreateNetwork function, especially where the bridge and the network are created. Do you know which line the configs are applied, I can only see the creation of the configs.
CreateNetwork function creates the bridge before creating the nodes, as we need a way to bind the to the bridge before nodes are created. CNI config is generated and applied closer to the end of the function via the CNI libraries.
I just wanted to add to this thread. I am running into this on an Ubuntu derivative as well.
❯ bat /etc/lsb-release
───────┬────────────────────────────────────────────
│ File: /etc/lsb-release
───────┼────────────────────────────────────────────
1 │ DISTRIB_ID=Pop
2 │ DISTRIB_RELEASE=22.04
3 │ DISTRIB_CODENAME=jammy
4 │ DISTRIB_DESCRIPTION="Pop!_OS 22.04 LTS"
Once I disabled ufw, I could get passed this error.