k3s-ansible
k3s-ansible copied to clipboard
Support HA mode with embedded DB
This enables initializing a cluster in HA mode with an embedded DB. https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/
When multiple masters are specified in the master group, k3s-ansible will add the necessary flags during the initialization phase (i.e. --cluster-init and --server)
For the embedded HA mode to work the k3s version must be >= v1.19.1
Closes #32
Right now nodes are registering using a non HA endpoint. Either this playbook needs to create such endpoint like when loadbalancer_apiserver_localhost
is used in kubespray, or ask the user to provide an external loadbalancer endpoint. https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md
Is there any particular thing that is blocking from this being merged? I am finding that we need the HA functionality in our deployments. Let me know if there is anything I can do to help.
Is there any particular thing that is blocking from this being merged? I am finding that we need the HA functionality in our deployments. Let me know if there is anything I can do to help.
I think it's ready to merge but if you want to try it first I would be very interested in your feedback :)
I can give it a go. I invoke it as a dependency in a ansible role, so it won't be testing as is, but it could still be a useful to see.
Okay 👌 Let me know how it goes :)
I tested the current version and it works fine when running the first time. Resetting the cluster with reset.yml and rerunning the playbook also works fine.
It does break when re-running the playbook after the cluster has been successfully setup. The "Verify that all nodes actually joined" task fails:
TASK [k3s/master : Verify that all nodes actually joined] ***********************************************
fatal: [master-0]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
fatal: [master-2]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
fatal: [master-1]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
I think the error is caused by the node(s) without the node-role.kubernetes.io/master
label.
Any idea how to properly debug the unitl-expression?
I tried using https://jmespath.org/ with the output from the api and using the jmespath expression, but the website prints the correct result.
Yes, the playbook tries to verify that all masters joined the cluster. I suspect that they are each creating a 1 node cluster. But I don't know why. I've tried inside vagrant VMs but I don't quite understand where is the problem at the moment.
@itwars I added you as reviewer, I think the change is ready to merge now, if you could take a look I would greatly appreciate.
on 2 fresh ubuntu (for testing purposes with the embedded db) i got the 20 retries
fatal: [rancher-02]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.174167", "end": "2021-03-16 23:18:28.384937", "rc": 0, "start": "2021-03-16 23:18:28.210770", "stderr": "", "stderr_lines": [], "stdout": "rancher-02", "stdout_lines": ["rancher-02"]}
fatal: [rancher-01]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.171841", "end": "2021-03-16 23:18:28.568724", "rc": 0, "start": "2021-03-16 23:18:28.396883", "stderr": "", "stderr_lines": [], "stdout": "rancher-01", "stdout_lines": ["rancher-01"]}
any log or test you want me to try ?
any log or test you want me to try ?
To debug you need to access the logs of the k3s-init service. Using journalctl -ef -u k3s-init
while the playbook is running can give info on why it is failing. Also to start fresh running the reset playbook is a good idea after a failing run.
I will add a quick message to state how to debug errors if the verify task fails.
Im cross-posting the resul here from https://github.com/k3s-io/k3s-ansible/issues/32 sorry for my bad english im trying my best! hope everything is clear
Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325203 9900 reconciler.go:319] Volume detached for volume "helm-traefik-token-fccln" (UniqueName: "kubernetes.io/secret/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-helm-traefik-token-fccln") on node "rancher-01" DevicePath ""
Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325209 9900 reconciler.go:319] Volume detached for volume "values" (UniqueName: "kubernetes.io/configmap/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-values") on node "rancher-01" DevicePath ""
Mar 17 23:15:13 Rancher-01 k3s[9900]: W0317 23:15:13.044943 9900 pod_container_deletor.go:79] Container "2cded66bbca92e6097862fe9aa41fbc37b2b9ca6714d7032c6cf809b48eea7b5" not found in pod's containers
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740228 9900 remote_runtime.go:332] ContainerStatus "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740251 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740607 9900 remote_runtime.go:332] ContainerStatus "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740624 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740862 9900 remote_runtime.go:332] ContainerStatus "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740875 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741108 9900 remote_runtime.go:332] ContainerStatus "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741121 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741329 9900 remote_runtime.go:332] ContainerStatus "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741350 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741590 9900 remote_runtime.go:332] ContainerStatus "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741610 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741889 9900 remote_runtime.go:332] ContainerStatus "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741904 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:18:05 Rancher-01 systemd[1]: Stopping /usr/local/bin/k3s server --cluster-init...
Mar 17 23:18:05 Rancher-01 k3s[9900]: I0317 23:18:05.436838 9900 network_policy_controller.go:157] Shutting down network policies full sync goroutine
Mar 17 23:18:05 Rancher-01 k3s[9900]: {"level":"warn","ts":"2021-03-17T23:18:05.444Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock 0 }. Err :connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting..."}
@mattthhdp Normally k3s-ansible should be able to entirely replace the https://get.k3s.io script.
Tell me if I'm wrong but to sum up:
- k3s-ansible does not work on your VMs (Both with and without the Multi master support changes).
- Running the https://get.k3s.io script works to provision a 1 node cluster.
- You did not tried running a multi master cluster with the https://get.k3s.io script yet.
If that's true then I think we should create another issue to keep track of your problem. Since k3s-ansible does not work even without the changes of this PR then we should fix your problem separately of this PR. Would you mind creating a new issue summarizing what you did and the full logs ?
for points 1- the master script,without multi-master, is working perfectly (with 1 master and 2 worker), if i reset (with the provided script) and try with 3 master and 0 worker Also working but as expected, when i ssh into 1 of the master and i do kubctl get nodes, i only see the node im connected too.
2- exactly (as with the ansible playbook from the without multi-master)
3- if i try the playbook for multi-master (1 master and 2 worker as above) i get an error. worker-journal.txt master-status.txt worker-status.txt master-journal.txt
i also see that when i do sudo kubectl get nodes, i don't see the etcd roles.
Notice in the worker journal the error
time="2021-03-28T17:02:13Z" level=fatal msg="flag needs an argument: -token"
EDIT:i will try the multi master with the https://get.k3s.io script and report asap.
with the command curl -sfL https://get.k3s.io | K3S_TOKEN="SECRET" sh -s - --cluster-init
on master-node-1
and
curl -sfL https://get.k3s.io | K3S_TOKEN="SECRET" sh -s - --server https://ha1:6443
on master-node-2
jaune@ha1:~$ sudo kubectl get nodes NAME STATUS ROLES AGE VERSION ha1 Ready control-plane,etcd,master 2m40s v1.20.4+k3s1 ha2 Ready control-plane,etcd,master 12s v1.20.4+k3s1 everything work as it should
PS: i have try with both the version in the group_var (i think 1.17.5) and the 1.20.4 and i have try with and without -K flag
PSS: if you want i can give you acces to my test vm. i really appreciate the time you spend trying to help me.
@St0rmingBr4in ... i feel like ... the error was that, i haven't put anything into the k3s_token"" variable ... im reallly sorry ! maybe we should check if the variable is empty ? XD thank you again.
something like
tasks:
- fail: msg="The variable 'k3s_token' is not defined or empty"
when: (k3s_tokenis not defined) or (k3s_token|length == 0)
@mattthhdp So you confirm it's working when k3s_token is defined? I will update the PR accordingly
@mattthhdp So you confirm it's working when k3s_token is defined? I will update the PR accordingly
Exact.
I did some additional testing today. A 3 master, 3 nodes cluster with various OS (debian 10, ubutnu 20.04, centos 8). Everything looks fine, I was not able to break anything.
Fedora 33 failed but that is not really a target platform according to the readme 😉
I just finished testing on debian buster in HA mode with one slave and in single master mode with one slave. It works fine. I would say this is ready to merge.
I have some troubles here with Centos7:
TASK [k3s/master : Init cluster inside the transient k3s-init service] *********************************************************************************************************************************************************************
fatal: [192.168.8.101]: FAILED! => changed=true
cmd:
- systemd-run
- -p
- RestartSec=2
- -p
- Restart=on-failure
- --unit=k3s-init
- k3s
- server
- --cluster-init
- --token
- diefu4hei1Quei0VeemiT8Egh
delta: '0:00:00.003807'
end: '2021-04-02 00:38:45.161266'
msg: non-zero return code
rc: 1
start: '2021-04-02 00:38:45.157459'
stderr: |-
Unknown assignment RestartSec=2.
Failed to create bus message: No such device or address
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
fatal: [192.168.8.102]: FAILED! => changed=true
cmd:
- systemd-run
- -p
- RestartSec=2
- -p
- Restart=on-failure
- --unit=k3s-init
- k3s
- server
- --server
- https://192.168.8.101:6443
- --token
- diefu4hei1Quei0VeemiT8Egh
delta: '0:00:00.003674'
end: '2021-04-02 00:38:45.205537'
msg: non-zero return code
rc: 1
start: '2021-04-02 00:38:45.201863'
stderr: |-
Unknown assignment RestartSec=2.
Failed to create bus message: No such device or address
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
fatal: [192.168.8.103]: FAILED! => changed=true
cmd:
- systemd-run
- -p
- RestartSec=2
- -p
- Restart=on-failure
- --unit=k3s-init
- k3s
- server
- --server
- https://192.168.8.101:6443
- --token
- diefu4hei1Quei0VeemiT8Egh
delta: '0:00:00.003881'
end: '2021-04-02 00:38:45.230181'
msg: non-zero return code
rc: 1
start: '2021-04-02 00:38:45.226300'
stderr: |-
Unknown assignment RestartSec=2.
Failed to create bus message: No such device or address
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
It's systemd 219 (systemd-219-78.el7_9.3.x86_64).
Oh yeah this super old version of systemd... I'll try to make it work with that.
Thanks. Let me know if I can help/test/whatever. Unfortunately, as this is the "newest" systemd in Centos7, I may still be widley used.
I just finished testing on debian buster in HA mode with one slave and in single master mode with one slave. It works fine. I would say this is ready to merge.
I just tested HA mode with three Raspberry Pi 4 8GB RAM running Raspberry Pi OS 64-bit. Three masters and no workers. Works great, thank you!
So it turns out it is not possible to make things work with centos 7 without changing a lot of stuff. https://github.com/systemd/systemd/issues/4402
Hi,
I am getting following error while trying to setup ha
Jun 15 01:46:49 ip-172-30-1-13 k3s[8118]: time="2021-06-15T01:46:49.141005206Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Jun 15 01:46:50 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:50.161Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:51 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:51.162Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:52 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:52.853Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:54 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:54.141Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:55 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:55.141Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:55 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:55.175Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
My Setup -> ubuntu version 18.04 -> trying to setup 3 master nodes. -> all the nodes are reachable from eachother
Here is the group_vars
---
k3s_version: v1.19.5+k3s1
ansible_user: ubuntu
systemd_dir: /etc/systemd/system
apiserver_endpoint: "{{ hostvars[groups['master'][0]]['ansible_host'] | default(groups['master'][0]) }}"
k3s_token: "mysupersecuretoken"
extra_server_args: ""
extra_agent_args: ""


@St0rmingBr4in do I need to add extra arguments for this or setup a seperate etcd server outside the cluster for it to properly work?
@St0rmingBr4in Are there any news/updates? Also, is the current state of the PR working or is there still some functionality missing?
@TannerGabriel This PR is working on all platforms but centos due to the version of systemd being too old. To make it work for centos we would need to rewrite it in a different way.
@St0rmingBr4in Thanks. I will probably try it out myself this weekend. Is the centos error the reason why the PR is not getting merged, or are there some other holdups (things that still need to be done)?
@TannerGabriel Yes this is the main blocker.
According to this article, CentOS 7 can now run systemd 231. Has anyone tried this?