Multi controller install fails on OL9
When attempting to install k0s via k0sctl using the multi-controller setup it fails to install, this doesn't happen if only one controller (or controller+worker) and the rest of the nodes are workers. I have tested using node local load balancing and no load balancing but same issue arrises on both cases.
System Information: os-release
NAME="Oracle Linux Server"
VERSION="9.3"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.3"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:3:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"
ORACLE_BUGZILLA_PRODUCT="Oracle Linux 9"
ORACLE_BUGZILLA_PRODUCT_VERSION=9.3
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=9.3
kernel: Linux fwd-oracle 5.14.0-362.13.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Dec 21 22:34:57 PST 2023 x86_64 x86_64 x86_64 GNU/Linux
k0sctl config:
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
name: k0s-cluster
spec:
hosts:
- ssh:
address: 192.168.15.216
user: user
port: 22
keyPath: /home/user/.ssh/id_ed25519
hostname: mc-poc-m1
role: controller+worker
uploadBinary: true
k0sBinaryPath: /usr/local/bin/k0s
files:
- src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
dstDir: /var/lib/k0s/images/
perm: 075
- ssh:
address: 192.168.14.186
user: user
port: 22
keyPath: /home/user/.ssh/id_ed25519
hostname: mc-poc-m2
role: controller+worker
uploadBinary: true
k0sBinaryPath: /usr/local/bin/k0s
files:
- src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
dstDir: /var/lib/k0s/images/
perm: 075
- ssh:
address: 192.168.15.88
user: user
port: 22
keyPath: /home/user/.ssh/id_ed25519
hostname: mc-poc-m3
role: controller+worker
uploadBinary: true
k0sBinaryPath: /usr/local/bin/k0s
files:
- src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
dstDir: /var/lib/k0s/images/
perm: 075
- ssh:
address: 192.168.14.252
user: user
port: 22
keyPath: /home/user/.ssh/id_ed25519
hostname: mc-poc-wq
role: worker
uploadBinary: true
k0sBinaryPath: /usr/local/bin/k0s
files:
- src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
dstDir: /var/lib/k0s/images/
perm: 075
- ssh:
address: 192.168.15.131
user: user
port: 22
keyPath: /home/user/.ssh/id_ed25519
hostname: mc-poc-w2
role: worker
uploadBinary: true
k0sBinaryPath: /usr/local/bin/k0s
files:
- src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
dstDir: /var/lib/k0s/images/
perm: 075
k0s:
version: v1.28.5+k0s.0
dynamicConfig: false
config:
spec:
network:
calico:
mode: vxlan
overlay: always
vxlanPort: 4789
vxlanVNI: 4096
mtu: 0
wireguard: true
clusterDomain: cluster.local
dualStack: {}
kubeProxy:
mode: iptables
podCIDR: 10.244.0.0/16
provider: calico
serviceCIDR: 10.96.0.0/12
nodeLocalLoadBalancing:
enabled: true
type: EnvoyProxy
telemetry:
enabled: false
status: {}
logs: k0sctl.log
Based on this issue: https://github.com/k0sproject/k0s/issues/3337#issuecomment-1912654654
This is probably unrelated, but:
- src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
dstDir: /var/lib/k0s/images/
perm: 075
That will make the permissions ---rwxr-x, I suppose it's not a problem when running as root/sudo.
In the logs, I see these:
time="29 Jan 24 17:35 UTC" level=debug msg="retrying, attempt 8 - last error: command failed: client exec: ssh session wait: Process exited with status 7"
time="29 Jan 24 17:35 UTC" level=debug msg="[ssh] 192.168.14.186:22: executing `curl -kso /dev/null --connect-timeout 20 -w \"%{http_code}\" \"https://localhost:6443/version\"`"
Based on that, it seems the second controller is having tough time joining the cluster. I'd look into the status of k0s in that node to see if there's any hints on why it's having tough time. Log into that machine and look into the logs:
journalctl -u k0scontroller ...
@kke would it be possible/make sense if k0sctl could do something like this automatically when it sees k0s is not getting up as expected?
I re-ran it again and I noticed that the script failed a bit faster while trying to acquire a lock, adding to this there was no logs in the nodes due to never reaching the step where the node is installed or setup.
https://k0sproject.io/licenses/eula
INFO ==> Running phase: Connect to hosts
INFO [ssh] 192.168.15.216:22: connected
INFO [ssh] 192.168.14.252:22: connected
INFO [ssh] 192.168.15.131:22: connected
INFO [ssh] 192.168.14.186:22: connected
INFO [ssh] 192.168.15.88:22: connected
INFO ==> Running phase: Detect host operating systems
INFO [ssh] 192.168.15.131:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.15.216:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.14.186:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.15.88:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.14.252:22: is running Oracle Linux Server 9.3
INFO ==> Running phase: Acquire exclusive host lock
INFO ==> Running phase: Prepare hosts
INFO ==> Running phase: Gather host facts
INFO [ssh] 192.168.14.186:22: using mc-poc-m2 from configuration as hostname
INFO [ssh] 192.168.15.216:22: using mc-poc-m1 from configuration as hostname
INFO [ssh] 192.168.14.252:22: using mc-poc-wq from configuration as hostname
INFO [ssh] 192.168.15.131:22: using mc-poc-w2 from configuration as hostname
INFO [ssh] 192.168.15.88:22: using mc-poc-m3 from configuration as hostname
INFO [ssh] 192.168.14.186:22: discovered eth0 as private interface
INFO [ssh] 192.168.15.216:22: discovered eth0 as private interface
INFO [ssh] 192.168.15.131:22: discovered eth0 as private interface
INFO [ssh] 192.168.14.252:22: discovered eth0 as private interface
INFO [ssh] 192.168.15.88:22: discovered eth0 as private interface
INFO ==> Running phase: Validate hosts
INFO ==> Running phase: Gather k0s facts
INFO [ssh] 192.168.15.216:22: found existing configuration
INFO [ssh] 192.168.14.186:22: found existing configuration
INFO [ssh] 192.168.15.88:22: found existing configuration
INFO ==> Running phase: Validate facts
INFO ==> Running phase: Upload files to hosts
INFO [ssh] 192.168.15.131:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.15.216:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.14.186:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.15.88:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.14.252:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.15.131:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.14.186:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.15.216:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.14.252:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.15.88:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.15.216:22: validating configuration
INFO [ssh] 192.168.14.186:22: validating configuration
INFO [ssh] 192.168.15.88:22: validating configuration
INFO ==> Running phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: installing k0s controller
INFO * Running clean-up for phase: Acquire exclusive host lock
INFO * Running clean-up for phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: cleaning up
INFO ==> Apply failed
@kke would it be possible/make sense if k0sctl could do something like this automatically when it sees k0s is not getting up as expected?
Hmm, interesting idea, so it would try to dig up some diagnostics logs on failure 🤔 That could be handy.
INFO ==> Running phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: installing k0s controller
INFO * Running clean-up for phase: Acquire exclusive host lock
INFO * Running clean-up for phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: cleaning up
INFO ==> Apply failed
No error displayed? That's not nice.
The lock file is just for avoiding two instances of k0sctl operating at the same time, maybe it should be more quiet about it. The actual problem is somewhere else.
New discovery! I copied the install command and ran it as a standalone command in the server where the logs specified (without escaping) and it is causing a Null Pointer Exception.
log line:
time="31 Jan 24 19:16 UTC" level=debug msg="[ssh] 192.168.15.216:22: executing `sudo -s -- /usr/local/bin/k0s install controller --data-dir=/var/lib/k0s --enable-worker --config \"/etc/k0s/k0s.yaml\" --kubelet-extra-args=\"--hostname-override=mc-poc-m1\"`"
the current content of /etc/k0s/k0s.yaml is the following:
apiVersion: k0s.k0sproject.io/v1beta1
kind: ClusterConfig
spec:
api:
address: 192.168.15.216
sans:
- 192.168.15.216
- 192.168.14.186
- 192.168.15.88
- 127.0.0.1
controllerManager: {}
extensions: null
installConfig: null
konnectivity:
adminPort: 8133
agentPort: 8132
network:
calico:
mode: vxlan
mtu: 0
overlay: always
vxlanPort: 4789
vxlanVNI: 4096
wireguard: true
clusterDomain: cluster.local
dualStack: {}
kubeProxy:
mode: iptables
podCIDR: 10.244.0.0/16
provider: calico
serviceCIDR: 10.96.0.0/12
podSecurityPolicy:
defaultPolicy: 00-k0s-privileged
scheduler: {}
telemetry:
enabled: false
status: {}
command without escaping:
sudo -s -- /usr/local/bin/k0s install controller --data-dir=/var/lib/k0s --enable-worker --config /etc/k0s/k0s.yaml --kubelet-extra-args="--hostname-override=mc-poc-m1"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2b2c179]
goroutine 1 [running]:
github.com/k0sproject/k0s/pkg/install.CreateControllerUsers(0x6b?, 0xc0003b7680)
/go/src/github.com/k0sproject/k0s/pkg/install/users.go:41 +0x39
github.com/k0sproject/k0s/cmd/install.(*command).setup(0xc001345ce0, {0x3935e03, 0xa}, {0xc00139eeb0, 0x5, 0x5}, 0xc000673e40)
/go/src/github.com/k0sproject/k0s/cmd/install/install.go:68 +0xca
github.com/k0sproject/k0s/cmd/install.installControllerCmd.func1(0xc00127d500, {0x3915737?, 0x5?, 0x5?})
/go/src/github.com/k0sproject/k0s/cmd/install/controller.go:62 +0x197
github.com/spf13/cobra.(*Command).execute(0xc00127d500, {0xc00139e0a0, 0x5, 0x5})
/run/k0s-build/go/mod/github.com/spf13/[email protected]/command.go:940 +0x862
github.com/spf13/cobra.(*Command).ExecuteC(0xc001268300)
/run/k0s-build/go/mod/github.com/spf13/[email protected]/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
/run/k0s-build/go/mod/github.com/spf13/[email protected]/command.go:992
github.com/k0sproject/k0s/cmd.Execute()
/go/src/github.com/k0sproject/k0s/cmd/root.go:194 +0x1e
main.main()
/go/src/github.com/k0sproject/k0s/main.go:43 +0x225
I think this is great as we are not running blind anymore.
Great news!, I was able to solve the issues. The above issue is due too a validation error while processing the yaml configuration, basically installConfig: null causes the yaml to generate a users list to be a validation here could probably solve this issue to re-assign the default users in case the value is nil.
After this I faced another issue with etcd where I noticed that the the systemctl service was using the os hostname and not the hostname from the configuration.
etcd --peer-trusted-ca-file=/var/lib/k0s/pki/etcd/ca.crt --peer-key-file=/var/lib/k0s/pki/etcd/peer.key --log-level=info --peer-client-cert-auth=true --enable-pprof=false --name=fwd-oracle, the issue here is that all nodes had the same hostname and this was breaking etcd, this also explains why when running on one controller + multi-worker it was working with no issues. I guess as a validation for k0sctl a check to verify that no multiple hosts with the same hostname are used as it will break etcd.
I'm also using this value in the config but I'm not sure if the value is being used internally in k0s or it is setting the hostname in the OS.
Please feel free to close this issue or keep it open to track the validations.
The above issue is due too a validation error while processing the yaml configuration, basically installConfig: null causes the yaml to generate a users list to be a validation here could probably solve this issue to re-assign the default users in case the value is nil.
That should already be happening here - I haven't figured out yet why it isn't.
I guess as a validation for k0sctl a check to verify that no multiple hosts with the same hostname are used as it will break etcd.
That should be validated already:
https://github.com/k0sproject/k0sctl/blob/main/phase/validate_hosts.go#L54-L60
I'm also using this value in the config but I'm not sure if the value is being used internally in k0s or it is setting the hostname in the OS.
That is only used as --kubelet-extra-args="--hostname-override=<hostname>" when installing k0s (and as the hostname when querying node status). It is not set to the os.
K0s does not look at that when starting etcd but will always use os.Hostname().
It looks like you found two k0s bugs 🥇