trafficstars

Bug Report

Description

When I try to provision a local qemu cluster on Fedora 35 I get network connection errors when talosctl waits for the API. If I disable the firewall with sudo systemctl stop firewalld and restart docker sudo systemctl restart docker the problem disappears.

Logs

> sudo -E _out/talosctl-linux-amd64 cluster create     --provisioner=qemu     --cidr=172.20.0.0/24     --registry-mirror docker.io=http://172.20.0.1:5000     --registry-mirror k8s.gcr.io=http://172.20.0.1:5001      --registry-mirror quay.io=http://172.20.0.1:5002     --registry-mirror gcr.io=http://172.20.0.1:5003     --registry-mirror ghcr.io=http://172.20.0.1:5004     --registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005     --install-image=127.0.0.1:5005/siderolabs/installer:v1.1.0-alpha.1-54-g129f3e6e2-dirty    --masters 1     --workers 0     --with-bootloader=true --wait --config-patch '[{"op":"add","path":"/machine/install/extensions","value":[{"image":"127.0.0.1:5005/siderolabs/hello-world-service:v1.0.0-5-g3ccc1b5-dirty"}]}]'
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/root/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-23"
waiting for API
bootstrap error: 4 error(s) occurred:
        rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.0.2:50000: connect: connection refused"
        rpc error: code = DeadlineExceeded desc = context deadline exceeded
        rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.0.2:50000: connect: no route to host"
        timeout

Environment

Talos Installer: v1.1.0-alpha.1-54
Platform: Fedora release 35 (Thirty Five)

May 12 '22 10:05 sauterp

Do you have any networks claiming same CIDR? ip addr show

I would also check docker network ls

May 12 '22 12:05 smira

Do you have any networks claiming same CIDR? ip addr show

I would also check docker network ls

Both of those look good to me.

I see this in journalctl -x -e while provisioning the cluster: May 13 11:37:22 robert kernel: IN_public_REJECT: IN=talos3d3a9d82 OUT= MAC= SRC=172.20.0.1 DST=172.20.0.255 LEN=542 TOS=0x00 PREC=0x00 TTL=64 ID=28060 DF PROTO=UDP SPT=44842 DPT=21027 LEN=522

May 13 '22 09:05 sauterp

:shrug: talosctl is using standard CNI plugins to set up the bridge and networking...

this packet looks to be DHCP packet probably? It's still local to the node (should be)

May 13 '22 11:05 smira

When I run sudo -E ./_out/talosctl-linux-amd64 cluster destroy --provisioner=qemu I see this in the systemd logs of firewalld: firewalld[902]: ERROR: UNKNOWN_SOURCE: '172.20.0.2/32' is not in any zone

May 24 '22 12:05 sauterp

I got one small step ahead. Previously the log of the master node would repeat this forever:

[   72.414966] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

When I do this, gets past that:

sudo firewall-cmd --permanent --zone=trusted --add-interface=talos3d3a9d82
systemctl restart firewalld

Now it repeats this a lot:

[  325.284751] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 1.1.1.1:53: read udp 172.20.0.2:42657->1.1.1.1:53: i/o timeout"}

May 26 '22 10:05 sauterp

I am having a similar issue, talosctl never gets past waiting for API with qemu. Qemu nodes refuse to communicate with each other. 10.5.0.1 pings perfectly fine but 10.5.0.2-5 don't.

Docker initially also didn't work, but after disabling firewalld and restarting docker as suggested here it worked.

journalctl didn't really give any clear errors.

Running Alma Linux

Jun 19 '22 17:06 Davincible

I have the same Issue on openSUSE tumblewee. Starting the cluster in Qemu quits with an error message:

» sudo -E talosctl cluster create --provisioner=qemu --with-uefi=false
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/root/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-4"
waiting for API
bootstrap error: 3 error(s) occurred:
        rpc error: code = DeadlineExceeded desc = context deadline exceeded
        rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.5.0.2:50000: connect: no route to host"
        timeout

checking the network with nmap shows only 10.5.0.1 beeing reachable:

» nmap -sn 10.5.0.1/24                                               
Starting Nmap 7.92 ( https://nmap.org ) at 2022-06-20 22:24 CEST
Nmap scan report for 10.5.0.1
Host is up (0.00018s latency).
Nmap done: 256 IP addresses (1 host up) scanned in 3.10 seconds

While the logs of both the master and the worker node show this error:

[    3.610283] [talos] fetching machine config from: "http://10.5.0.1:45685/config.yaml"
[    3.610953] [talos] retrying error: Get "http://10.5.0.1:45685/config.yaml": dial tcp 10.5.0.1:45685: connect: network is unreachable
[    8.264910] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:45773->[::1]:53: read: connection refused"}
[    9.270952] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

When I destroy the cluster, I can see this error with journalctl -x -e:

firewalld[1057]: ERROR: UNKNOWN_SOURCE: '10.5.0.3/32' is not in any zone

This is likely the same error that @sauterp faces. I would speculate that the error lies in setting up the vm's network interface or connecting it to the network bridge.

Setting up a cluster with docker works without problems.

Jun 20 '22 20:06 hobyte

I got one small step ahead. Previously the log of the master node would repeat this forever:

[   72.414966] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

When I do this, gets past that:

sudo firewall-cmd --permanent --zone=trusted --add-interface=talos3d3a9d82
systemctl restart firewalld

Now it repeats this a lot:

[  325.284751] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 1.1.1.1:53: read udp 172.20.0.2:42657->1.1.1.1:53: i/o timeout"}

With this fix, I get past fetching the config, but the vm's are stuck pulling the installer image from ghcr and then timing out getting ntp time from the next IP:

[    4.854735] [talos] task install (1/1): starting
[    5.027136] [talos] pulling "ghcr.io/siderolabs/installer:v1.0.5"
INFO[00[    5.030125] [talos] retrying error: failed to pull image "ghcr.io/siderolabs/installer:v1.0.5": failed to resolve reference "ghcr.io/siderolabs/installer:v1.0.5": failed to do request: Head "https://ghcr.io/v2/siderolabs/installer/
manifests/v1.0.5": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:43372->[::1]:53: read: connection refused
03] trying next host                              error="failed to do request: Head \"https://ghcr.io/v2/siderolabs/installer/manifests/v1.0.5\": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:43372->[::1]:53: read: connection refused"
 host=ghcr.io
INFO[0003] trying next host                              error="failed to do request: Head \"https://ghcr.io/v2/siderolabs/installer/manifests/v1.0.5\": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:44487->[::1]:53: read: connection r
efused" host=ghcr.io
[    5.044540] [talos] retrying error: failed to pull image "ghcr.io/siderolabs/installer:v1.0.5": failed to resolve reference "ghcr.io/siderolabs/installer:v1.0.5": failed to do request: Head "https://ghcr.io/v2/siderolabs/installer/manifes
ts/v1.0.5": dial tcp: lookup ghcr.io on [::1]:53: read udp [::1]:44487->[::1]:53: read: connection refused
[    5.280540] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:57565->[::1]:53: read: connection refused"}
[    6.285757] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:45597->[::1]:53: read: connection refused"}
[    7.290855] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on [::1]:53: read udp [::1]:39912->[::1]:53: read: connection refused"}
[   28.300310] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 1.1.1.1:53: read udp 10.5.0.2:36277->1.1.1.1:53: i/o timeout"}

Jun 20 '22 20:06 hobyte

@hobyte How did you get the log output from the qemu vms?

My .log files only contain the startup command, and talosctl --talosconfig talos-config -n 10.5.0.2 dmesg failed to determine endpoints

Jun 22 '22 20:06 Davincible

See the Development Guide

Jun 22 '22 21:06 hobyte

Mine only contain the setup command

❯ tail -F talos-alderson-*.log
==> talos-alderson-master-1.log <==
starting /usr/libexec/qemu-kvm with args:
-m 2048 -smp cpus=1 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=d2:98:94:66:6a:1c -device virtio-rng-pci -device virtio-balloon,deflate-on-oom=on -monitor unix:/root/.talos/clusters/talos-alderson/talos-alderson-master-1.monitor,server,nowait -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=d253bb1f-f785-443a-9aab-bacbec385485 -drive format=raw,if=virtio,file=/root/.talos/clusters/talos-alderson/talos-alderson-master-1-0.disk,cache=unsafe -machine q35,accel=kvm -drive file=/root/.talos/clusters/talos-alderson/talos-alderson-master-1-flash0.img,format=raw,if=pflash -kernel /home/amidala/talos/images/v1.1.0-beta2/vmlinuz-amd64 -initrd /home/amidala/talos/images/v1.1.0-beta2/initramfs-amd64.xz -append init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 random.trust_cpu=on printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 console=ttyS0 reboot=k panic=1 talos.shutdown=halt talos.platform=metal talos.config=http://10.5.0.1:39589/config.yaml

==> talos-alderson-worker-1.log <==
starting /usr/libexec/qemu-kvm with args:
-m 2048 -smp cpus=1 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=7a:fc:f9:e2:56:6f -device virtio-rng-pci -device virtio-balloon,deflate-on-oom=on -monitor unix:/root/.talos/clusters/talos-alderson/talos-alderson-worker-1.monitor,server,nowait -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=42f7131e-cf02-44bf-9d0e-c3afeda7d542 -drive format=raw,if=virtio,file=/root/.talos/clusters/talos-alderson/talos-alderson-worker-1-0.disk,cache=unsafe -machine q35,accel=kvm -drive file=/root/.talos/clusters/talos-alderson/talos-alderson-worker-1-flash0.img,format=raw,if=pflash -kernel /home/amidala/talos/images/v1.1.0-beta2/vmlinuz-amd64 -initrd /home/amidala/talos/images/v1.1.0-beta2/initramfs-amd64.xz -append init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 random.trust_cpu=on printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 console=ttyS0 reboot=k panic=1 talos.shutdown=halt talos.platform=metal talos.config=http://10.5.0.1:45587/config.yaml

Jun 22 '22 21:06 Davincible

Interesting, how long do you wait for the logs? It might take some time for Talos to start up.

Jun 23 '22 08:06 hobyte

I just discovered the load balancer cannot reach the vms:

$ cat clusters/talos-default/lb.log   
2022/06/26 00:00:07 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:08 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:09 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:10 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:11 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:12 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:13 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:14 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:15 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:16 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:17 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:18 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:19 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:20 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:21 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:22 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:23 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
2022/06/26 00:00:24 healthcheck failed for "10.5.0.2:6443": dial tcp 10.5.0.2:6443: connect: connection refused
.
.
.

Is this normal behavior @smira when the api is not reachable?

Jun 25 '22 22:06 hobyte

I think it's expected to happen in the beginning, before the cluster has bootstrapped, but after a few min the kubelets should be running and it should resolve. << my qemu cluster never gets to that place

Jun 25 '22 22:06 Davincible

Im just out of Ideas on how to find the problem. I think its the cni, but I see no logs and don't know how to debug it.

Jun 25 '22 22:06 hobyte

Yeah I think so too. I moved on to playing with Talos on virtual private servers. Its distro related though, works fine on other distros

Jun 25 '22 22:06 Davincible

Yeah, I tested it on Ubuntu and it works fine. It just bothers me that i cannot get behind the problem. It would be nice to not have to use Ubuntu to develop talos

Jun 25 '22 22:06 hobyte

What you could try is to instead of using the qemu provisioner, manually create qemu-kvm's using the talos ISO

Jun 25 '22 22:06 Davincible

I have one Problem Identified: Deactivating firewalld solves the connection problem on my opensuse system. Now on to finding a solution.

Jun 26 '22 11:06 hobyte

I tried that, only allowed me to use docker, not qemu. Does qemu fully work for you now?

Jun 26 '22 11:06 Davincible

yes, the installation finished without problems. I quickly tested running a hello-world example example and installing the kubernetes dashboard and they both work 😀

Jun 27 '22 07:06 hobyte

I'm not an expert in RedHat-based distros, talosctl cluster create works just fine on Debian-based distros.

What we do is use CNI plugins to bring up the bridge on the host, assign IP to it, and bring up a pair of veth devices for each node. One end of the pair is connected to the bridge, another end is in separate network namespace, connected to the tap device which is connected directly to QEMU.

Jun 27 '22 12:06 smira

I'm not an expert in RedHat-based distros, talosctl cluster create works just fine on Debian-based distros.

What we do is use CNI plugins to bring up the bridge on the host, assign IP to it, and bring up a pair of veth devices for each node. One end of the pair is connected to the bridge, another end is in separate network namespace, connected to the tap device which is connected directly to QEMU.

Okay, I might Investigate the cni more. Is there a configuration for it, especially the routing?

Jun 27 '22 21:06 hobyte

It should be pretty much default bridge CNI config: https://github.com/siderolabs/talos/blob/master/pkg/provision/providers/vm/network.go#L153-L193

Nothing fancy here, and CNI plugin is supposed to set up things with the quirks for each OS.

Jun 28 '22 12:06 smira

It's not the cni. But I had some problems with the cashing registries and the problem has the same source:

firewalld prevents talosctl from connecting with the vms.

-> Workaround: stop firewalld

cashing registries don't work when firewalld is stoppen

Create registry proxies with stopped firewalld:

» hack/start-registry-proxies.sh                                                                                                                                                                          simon@edora
b5cab21e0024c71c4d69c269b9e3d5f0d843611b1542f8de27fcd51a73cd10b6
docker: Error response from daemon: driver failed programming external connectivity on endpoint registry-docker.io (e2f979cf543dd532e9211bcc28e548bd0fa44cd264d8d7f27c4bde41ca6592cb):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 5000 -j DNAT --to-destination 172.17.0.2:5000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)).

-> firewalld must be active

Summary

Add firewalld rules allowing connections from talosctl to VMs and Vice versa, Ideally, talosctl cliuster create will set them.

Jun 29 '22 10:06 hobyte

For what it's worth... bridge CNI plugin injects iptables rules to do the NAT. It feels the issue is communication between the VMs and the host IPs.

Jun 29 '22 12:06 smira

Workaround

(tested on openSUSE and fedora)

Stop firewalld

systemctl stop firewalld

run `talosctl cluster create` without cashing registries

sudo --preserve-env=HOME _out/talosctl-linux-amd64 cluster create --provisioner=qemu --cidr=172.20.0.0/24 --registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005 --install-image=127.0.0.1:5005/siderolabs/installer:v1.1.0-alpha.2-61-g87e7de30c --masters 3 --workers 2 --with-bootloader=false

Jun 29 '22 12:06 hobyte

@smira I'm trying to understand the CreateNetwork function, especially where the bridge and the network are created. Do you know which line the configs are applied, I can only see the creation of the configs.

Jul 11 '22 20:07 hobyte

@smira I'm trying to understand the CreateNetwork function, especially where the bridge and the network are created. Do you know which line the configs are applied, I can only see the creation of the configs.

CreateNetwork function creates the bridge before creating the nodes, as we need a way to bind the to the bridge before nodes are created. CNI config is generated and applied closer to the end of the function via the CNI libraries.

Jul 12 '22 13:07 smira

I just wanted to add to this thread. I am running into this on an Ubuntu derivative as well.

❯ bat /etc/lsb-release
───────┬────────────────────────────────────────────
       │ File: /etc/lsb-release
───────┼────────────────────────────────────────────
   1   │ DISTRIB_ID=Pop
   2   │ DISTRIB_RELEASE=22.04
   3   │ DISTRIB_CODENAME=jammy
   4   │ DISTRIB_DESCRIPTION="Pop!_OS 22.04 LTS"

Once I disabled ufw, I could get passed this error.

Sep 01 '22 22:09 bashfulrobot

talos
talos copied to clipboard

talosctl: connection errors when waiting for API while provisioning a local qemu cluster

Bug Report

Description

Logs

Environment

firewalld prevents talosctl from connecting with the vms.

cashing registries don't work when firewalld is stoppen

Summary

Workaround

Stop firewalld

run `talosctl cluster create` without cashing registries

talos talos copied to clipboard

talosctl: connection errors when waiting for API while provisioning a local qemu cluster

Bug Report

Description

Logs

Environment

firewalld prevents talosctl from connecting with the vms.

cashing registries don't work when firewalld is stoppen

Summary

Workaround

Stop firewalld

run talosctl cluster create without cashing registries

talos
talos copied to clipboard

run `talosctl cluster create` without cashing registries