k3s-ansible Support HA mode with embedded DB

This enables initializing a cluster in HA mode with an embedded DB. https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/

When multiple masters are specified in the master group, k3s-ansible will add the necessary flags during the initialization phase (i.e. --cluster-init and --server)

For the embedded HA mode to work the k3s version must be >= v1.19.1

Closes #32

Oct 28 '20 12:10 St0rmingBr4in

Right now nodes are registering using a non HA endpoint. Either this playbook needs to create such endpoint like when loadbalancer_apiserver_localhost is used in kubespray, or ask the user to provide an external loadbalancer endpoint. https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md

Nov 15 '20 17:11 St0rmingBr4in

Is there any particular thing that is blocking from this being merged? I am finding that we need the HA functionality in our deployments. Let me know if there is anything I can do to help.

Feb 18 '21 20:02 GideonStowell

Is there any particular thing that is blocking from this being merged? I am finding that we need the HA functionality in our deployments. Let me know if there is anything I can do to help.

I think it's ready to merge but if you want to try it first I would be very interested in your feedback :)

Feb 18 '21 21:02 St0rmingBr4in

I can give it a go. I invoke it as a dependency in a ansible role, so it won't be testing as is, but it could still be a useful to see.

Feb 18 '21 21:02 GideonStowell

Okay 👌 Let me know how it goes :)

Feb 18 '21 23:02 St0rmingBr4in

I tested the current version and it works fine when running the first time. Resetting the cluster with reset.yml and rerunning the playbook also works fine.

It does break when re-running the playbook after the cluster has been successfully setup. The "Verify that all nodes actually joined" task fails:

TASK [k3s/master : Verify that all nodes actually joined] ***********************************************
fatal: [master-0]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
fatal: [master-2]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
fatal: [master-1]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}

I think the error is caused by the node(s) without the node-role.kubernetes.io/master label. Any idea how to properly debug the unitl-expression? I tried using https://jmespath.org/ with the output from the api and using the jmespath expression, but the website prints the correct result.

Mar 05 '21 22:03 narkaTee

Yes, the playbook tries to verify that all masters joined the cluster. I suspect that they are each creating a 1 node cluster. But I don't know why. I've tried inside vagrant VMs but I don't quite understand where is the problem at the moment.

Mar 05 '21 23:03 St0rmingBr4in

@itwars I added you as reviewer, I think the change is ready to merge now, if you could take a look I would greatly appreciate.

Mar 15 '21 18:03 St0rmingBr4in

on 2 fresh ubuntu (for testing purposes with the embedded db) i got the 20 retries

 fatal: [rancher-02]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.174167", "end": "2021-03-16 23:18:28.384937", "rc": 0, "start": "2021-03-16 23:18:28.210770", "stderr": "", "stderr_lines": [], "stdout": "rancher-02", "stdout_lines": ["rancher-02"]}
fatal: [rancher-01]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.171841", "end": "2021-03-16 23:18:28.568724", "rc": 0, "start": "2021-03-16 23:18:28.396883", "stderr": "", "stderr_lines": [], "stdout": "rancher-01", "stdout_lines": ["rancher-01"]}

any log or test you want me to try ?

Mar 16 '21 23:03 mattthhdp

any log or test you want me to try ?

To debug you need to access the logs of the k3s-init service. Using journalctl -ef -u k3s-init while the playbook is running can give info on why it is failing. Also to start fresh running the reset playbook is a good idea after a failing run.

I will add a quick message to state how to debug errors if the verify task fails.

Mar 18 '21 12:03 St0rmingBr4in

Im cross-posting the resul here from https://github.com/k3s-io/k3s-ansible/issues/32 sorry for my bad english im trying my best! hope everything is clear

Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325203 9900 reconciler.go:319] Volume detached for volume "helm-traefik-token-fccln" (UniqueName: "kubernetes.io/secret/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-helm-traefik-token-fccln") on node "rancher-01" DevicePath ""
Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325209 9900 reconciler.go:319] Volume detached for volume "values" (UniqueName: "kubernetes.io/configmap/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-values") on node "rancher-01" DevicePath ""
Mar 17 23:15:13 Rancher-01 k3s[9900]: W0317 23:15:13.044943 9900 pod_container_deletor.go:79] Container "2cded66bbca92e6097862fe9aa41fbc37b2b9ca6714d7032c6cf809b48eea7b5" not found in pod's containers
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740228 9900 remote_runtime.go:332] ContainerStatus "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740251 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740607 9900 remote_runtime.go:332] ContainerStatus "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740624 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740862 9900 remote_runtime.go:332] ContainerStatus "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740875 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741108 9900 remote_runtime.go:332] ContainerStatus "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741121 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741329 9900 remote_runtime.go:332] ContainerStatus "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741350 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741590 9900 remote_runtime.go:332] ContainerStatus "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741610 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741889 9900 remote_runtime.go:332] ContainerStatus "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741904 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:18:05 Rancher-01 systemd[1]: Stopping /usr/local/bin/k3s server --cluster-init...
Mar 17 23:18:05 Rancher-01 k3s[9900]: I0317 23:18:05.436838 9900 network_policy_controller.go:157] Shutting down network policies full sync goroutine
Mar 17 23:18:05 Rancher-01 k3s[9900]: {"level":"warn","ts":"2021-03-17T23:18:05.444Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock 0 }. Err :connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting..."}

Mar 18 '21 14:03 mattthhdp

@mattthhdp Normally k3s-ansible should be able to entirely replace the https://get.k3s.io script.

Tell me if I'm wrong but to sum up:

k3s-ansible does not work on your VMs (Both with and without the Multi master support changes).
Running the https://get.k3s.io script works to provision a 1 node cluster.
You did not tried running a multi master cluster with the https://get.k3s.io script yet.

If that's true then I think we should create another issue to keep track of your problem. Since k3s-ansible does not work even without the changes of this PR then we should fix your problem separately of this PR. Would you mind creating a new issue summarizing what you did and the full logs ?

Mar 28 '21 16:03 St0rmingBr4in

for points 1- the master script,without multi-master, is working perfectly (with 1 master and 2 worker), if i reset (with the provided script) and try with 3 master and 0 worker Also working but as expected, when i ssh into 1 of the master and i do kubctl get nodes, i only see the node im connected too.

2- exactly (as with the ansible playbook from the without multi-master)

3- if i try the playbook for multi-master (1 master and 2 worker as above) i get an error. worker-journal.txt master-status.txt worker-status.txt master-journal.txt

i also see that when i do sudo kubectl get nodes, i don't see the etcd roles.

Notice in the worker journal the error time="2021-03-28T17:02:13Z" level=fatal msg="flag needs an argument: -token"

EDIT:i will try the multi master with the https://get.k3s.io script and report asap. with the command curl -sfL https://get.k3s.io | K3S_TOKEN="SECRET" sh -s - --cluster-init on master-node-1 and curl -sfL https://get.k3s.io | K3S_TOKEN="SECRET" sh -s - --server https://ha1:6443 on master-node-2

jaune@ha1:~$ sudo kubectl get nodes NAME STATUS ROLES AGE VERSION ha1 Ready control-plane,etcd,master 2m40s v1.20.4+k3s1 ha2 Ready control-plane,etcd,master 12s v1.20.4+k3s1 everything work as it should

PS: i have try with both the version in the group_var (i think 1.17.5) and the 1.20.4 and i have try with and without -K flag

PSS: if you want i can give you acces to my test vm. i really appreciate the time you spend trying to help me.

Mar 28 '21 17:03 mattthhdp

@St0rmingBr4in ... i feel like ... the error was that, i haven't put anything into the k3s_token"" variable ... im reallly sorry ! maybe we should check if the variable is empty ? XD thank you again.

something like

tasks:

- fail: msg="The variable 'k3s_token' is not defined or empty"
  when: (k3s_tokenis not defined) or (k3s_token|length == 0)

Mar 28 '21 17:03 mattthhdp

@mattthhdp So you confirm it's working when k3s_token is defined? I will update the PR accordingly

Mar 28 '21 19:03 St0rmingBr4in

@mattthhdp So you confirm it's working when k3s_token is defined? I will update the PR accordingly

Exact.

Mar 28 '21 20:03 mattthhdp

I did some additional testing today. A 3 master, 3 nodes cluster with various OS (debian 10, ubutnu 20.04, centos 8). Everything looks fine, I was not able to break anything.

Fedora 33 failed but that is not really a target platform according to the readme 😉

Mar 28 '21 23:03 narkaTee

I just finished testing on debian buster in HA mode with one slave and in single master mode with one slave. It works fine. I would say this is ready to merge.

Mar 29 '21 22:03 St0rmingBr4in

I have some troubles here with Centos7:

TASK [k3s/master : Init cluster inside the transient k3s-init service] *********************************************************************************************************************************************************************
fatal: [192.168.8.101]: FAILED! => changed=true
  cmd:
  - systemd-run
  - -p
  - RestartSec=2
  - -p
  - Restart=on-failure
  - --unit=k3s-init
  - k3s
  - server
  - --cluster-init
  - --token
  - diefu4hei1Quei0VeemiT8Egh
  delta: '0:00:00.003807'
  end: '2021-04-02 00:38:45.161266'
  msg: non-zero return code
  rc: 1
  start: '2021-04-02 00:38:45.157459'
  stderr: |-
    Unknown assignment RestartSec=2.
    Failed to create bus message: No such device or address
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
fatal: [192.168.8.102]: FAILED! => changed=true
  cmd:
  - systemd-run
  - -p
  - RestartSec=2
  - -p
  - Restart=on-failure
  - --unit=k3s-init
  - k3s
  - server
  - --server
  - https://192.168.8.101:6443
  - --token
  - diefu4hei1Quei0VeemiT8Egh
  delta: '0:00:00.003674'
  end: '2021-04-02 00:38:45.205537'
  msg: non-zero return code
  rc: 1
  start: '2021-04-02 00:38:45.201863'
  stderr: |-
    Unknown assignment RestartSec=2.
    Failed to create bus message: No such device or address
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
fatal: [192.168.8.103]: FAILED! => changed=true
  cmd:
  - systemd-run
  - -p
  - RestartSec=2
  - -p
  - Restart=on-failure
  - --unit=k3s-init
  - k3s
  - server
  - --server
  - https://192.168.8.101:6443
  - --token
  - diefu4hei1Quei0VeemiT8Egh
  delta: '0:00:00.003881'
  end: '2021-04-02 00:38:45.230181'
  msg: non-zero return code
  rc: 1
  start: '2021-04-02 00:38:45.226300'
  stderr: |-
    Unknown assignment RestartSec=2.
    Failed to create bus message: No such device or address
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

It's systemd 219 (systemd-219-78.el7_9.3.x86_64).

Apr 01 '21 22:04 tuxpeople

Oh yeah this super old version of systemd... I'll try to make it work with that.

Apr 02 '21 07:04 St0rmingBr4in

Thanks. Let me know if I can help/test/whatever. Unfortunately, as this is the "newest" systemd in Centos7, I may still be widley used.

Apr 02 '21 08:04 tuxpeople

I just finished testing on debian buster in HA mode with one slave and in single master mode with one slave. It works fine. I would say this is ready to merge.

I just tested HA mode with three Raspberry Pi 4 8GB RAM running Raspberry Pi OS 64-bit. Three masters and no workers. Works great, thank you!

Apr 16 '21 09:04 Tha-Fox

So it turns out it is not possible to make things work with centos 7 without changing a lot of stuff. https://github.com/systemd/systemd/issues/4402

Apr 21 '21 22:04 St0rmingBr4in

Hi,

I am getting following error while trying to setup ha

Jun 15 01:46:49 ip-172-30-1-13 k3s[8118]: time="2021-06-15T01:46:49.141005206Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Jun 15 01:46:50 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:50.161Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:51 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:51.162Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:52 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:52.853Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:54 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:54.141Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:55 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:55.141Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jun 15 01:46:55 ip-172-30-1-13 k3s[8118]: {"level":"warn","ts":"2021-06-15T01:46:55.175Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}

My Setup -> ubuntu version 18.04 -> trying to setup 3 master nodes. -> all the nodes are reachable from eachother

Here is the group_vars

---
k3s_version: v1.19.5+k3s1
ansible_user: ubuntu
systemd_dir: /etc/systemd/system
apiserver_endpoint: "{{ hostvars[groups['master'][0]]['ansible_host'] | default(groups['master'][0]) }}"
k3s_token: "mysupersecuretoken"
extra_server_args: ""
extra_agent_args: ""

Jun 15 '21 01:06 pen-pal

@St0rmingBr4in do I need to add extra arguments for this or setup a seperate etcd server outside the cluster for it to properly work?

Jun 15 '21 06:06 pen-pal

@St0rmingBr4in Are there any news/updates? Also, is the current state of the PR working or is there still some functionality missing?

Aug 28 '21 17:08 TannerGabriel

@TannerGabriel This PR is working on all platforms but centos due to the version of systemd being too old. To make it work for centos we would need to rewrite it in a different way.

Sep 01 '21 12:09 St0rmingBr4in

@St0rmingBr4in Thanks. I will probably try it out myself this weekend. Is the centos error the reason why the PR is not getting merged, or are there some other holdups (things that still need to be done)?

Sep 01 '21 15:09 TannerGabriel

@TannerGabriel Yes this is the main blocker.

Sep 01 '21 16:09 St0rmingBr4in

According to this article, CentOS 7 can now run systemd 231. Has anyone tried this?

Dec 23 '21 19:12 jon-stumpf

k3s-ansible k3s-ansible copied to clipboard

Support HA mode with embedded DB

k3s-ansible
k3s-ansible copied to clipboard