microk8s icon indicating copy to clipboard operation
microk8s copied to clipboard

Empty `traefik/provider.yaml` on worker nodes

Open sbidoul opened this issue 1 year ago • 17 comments

Summary

Worker nodes sometimes loose access to api server and become Not Ready, likely during and after restart of control plane nodes.

The situation is that traefik/provider.yaml is present but empty. Restoring traefik/provider.yaml by copying it from another worker node and doing snap restart microk8s is sufficient to recover the worker node.

Reproduction Steps

We can't reproduce reliably but the problem occurs regularly (it did in 1.25, and persists after upgrading the cluster to 1.27). It seems to happen when we restart dqlite nodes, or when they are upgraded.

Here is a the worker node log when it starts failing:

Sep 08 05:00:01 odoo-k8s-test-worker-3 microk8s.daemon-kubelite[2506]: E0908 05:00:01.609453    2506 controller.go:193] "Failed to update lease" err="Put \"https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/odoo-k8s-test-worker-3?timeout=10s\":>
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: 2023/09/08 05:00:24 updating endpoints
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: 2023/09/08 05:00:24 Config file changed on disk, will restart proxy
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Usage:
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:    apiserver-proxy [flags]
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Flags:
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:   -h, --help                        help for apiserver-proxy
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --kubeconfig string           path to kubeconfig file to use for updating list of known control plane nodes (default "/var/snap/microk8s/5891/credentials/kubelet.config")
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --refresh-interval duration   refresh interval (default 30s)
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --traefik-config string       path to apiserver proxy config file (default "/var/snap/microk8s/5891/args/traefik/traefik.yaml")

Introspection Report

Introspection report on the failing worker node available if needed.

sbidoul avatar Sep 08 '23 08:09 sbidoul

I have exactly the same issue. Fresh installation (v1.28.1) with 18 nodes. 3 nodes had empty provider.yaml , coping it from another node and restart helped

i6-xx avatar Sep 28 '23 18:09 i6-xx

Same for me with a fresh cluster (v1.28.1) with 3 master and 2 worker nodes. One of the workers had after joining the cluster an empy /var/snap/microk8s/current/args/traefik/provider.yaml file. After copying the content of it from the other worker node and restarting the service snap restart microk8s it startet working.

costigator avatar Oct 07 '23 07:10 costigator

I only have one worker node. ...traefik/providers.yaml is empty. I'm not entirely sure how this file is supposed to look.

Ah I see it, you provided one. Located in:

/var/snap/microk8s/current/args/traefik/provider-template.yaml

tcp:
  routers:
    Router-1:
      rule: "HostSNI(`*`)"
      service: "kube-apiserver"
      tls:
        passthrough: true
  services:
    kube-apiserver:
      loadBalancer:
        servers:
# APISERVERS
#      - address: "10.130.0.2:16443"
#      - address: "10.130.0.3:16443"
#      - address: "10.130.0.4:16443"

Workaround: Just copy this over into the "provider.yaml" file. in the same directory. Then uncomment the servers, updating the address with the local IPs of the control-plane nodes on your network.

then run

sudo snap stop microk8s
sudo snap start microk8s 

which should apply the config.

Worked for me :P

Per Documentation: https://microk8s.io/docs/configuring-services

snap.microk8s.daemon-traefik and snap.microk8s.daemon-apiserver-proxy The traefik and apiserver-proxy daemons are used in worker nodes to as a proxy to all API server control plane endpoints. The traefik daemon was replaced by the apiserver proxy in 1.25+ releases.

The most significant configuration option for both daemons is the API server endpoints found in ${SNAP_DATA}/args/traefik/provider.yaml. For apiserver-proxy daemon (1.25+ on wards) the refresh frequency of the available control plane endpoints can be set in ${SNAP_DATA}/args/apiserver-proxy via the --refresh-interval parameter.

Mbd06b avatar Oct 19 '23 20:10 Mbd06b

Similar issue, the workers are suddenly reported as Not Ready when I restarted them, v1.28.1. Turns out the provider.yaml is empty, but a bit late for me as I already re-joined one of the worker nodes with fresh microk8s snap package.

Seems a serious issue for production environment.

pampie avatar Oct 22 '23 22:10 pampie

Same situation during upgrade from 1.27.6 to 1.28.2 on a worker node after the upgrade of the dqlite nodes

adrienpeiffer avatar Oct 26 '23 11:10 adrienpeiffer

Same here with fresh installation of microk8s 1.28.3/stable.

It reappears randomly, Only occurs for some worker nodes while other worker nodes works fine.

I'll observe more

xinstein avatar Nov 23 '23 00:11 xinstein

I have the same situvation, and updating the "provider.yaml" with APISERVER endpoints did not fixed the issue. kubernetes versions i used was 1.28.3 and 1.29.0

I have a six node HA cluster. microk8s status microk8s is running high-availability: yes datastore master nodes: 10.40.101.83:19001 10.40.101.185:19001 10.40.101.186:19001 datastore standby nodes: 10.40.101.85:19001 10.40.101.128:19001 10.40.101.129:19001

When any of the datastore nodes are shutdown, other nodes moves to NotReady status. This happens occasionally

xaa@ha-02:~$ date Fri 2 Feb 12:28:23 UTC 2024 xaa@ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ha-05 Ready 66m v1.29.0 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-01 Ready 83m v1.29.0 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-06 Ready 61m v1.29.0 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-04 Ready 71m v1.29.0 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-03 Ready 75m v1.29.0 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-02 Ready 78m v1.29.0 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 xaa@ha-02:~$

Every 2.0s: kubectl get nodes ha-03: Fri Feb 2 12:30:04 2024

NAME STATUS ROLES AGE VERSION ha-03 Ready 76m v1.29.0 ha-02 Ready 79m v1.29.0 ha-06 NotReady 62m v1.29.0 ha-01 NotReady 85m v1.29.0 ha-04 NotReady 73m v1.29.0 ha-05 NotReady 68m v1.29.0

It takes 15 to 20 mints to recover automatically, and the applications are not accessiable during this window.

systemctl stop snap.microk8s.daemon-k8s-dqlite; sleep 2; systemctl start snap.microk8s.daemon-k8s-dqlite Restarting the dqlite service on the nodes that shows "NotReady", can bring the node to "Ready" state quickly

Any fix for this issue..?

DileepAP avatar Feb 05 '24 08:02 DileepAP

I just want to bump this and keep this issue alive. I've had to manually reapply that providers.yaml on one worker-node over a dozen times now over the last 7-8 months.

If anyone has any idea of the root cause of this, such if something else is causing this, that would be great to know.

Mbd06b avatar Apr 23 '24 13:04 Mbd06b

FWIW, this seems to happen occasionally on worker nodes when dqlite nodes reboot.

I found this code which manipulates provider.yaml but this seems to be used on join only?

sbidoul avatar Apr 23 '24 14:04 sbidoul

Does the configuration here. I KNOW it needs to be present in the worker-node in order for the node to connect and show "Ready".
image

Does this providers.yaml need to be applied in all the master-node args for it to persist to the worker node during resets? Or would this disrupt the functioning of the nodes on the control plane? Should I try it? Thoughts?

Mbd06b avatar Apr 23 '24 14:04 Mbd06b

Ok I think I fixed this. [Update 5-9-24] UGH no. I didn't fix it, It's still broken as ever. As a workaround, I've built a cronjob that basically checks that file every 5 mins will take the provider.yaml and stuff a fresh copy in there and reboot microk8s on the worker node if the provider.yaml file disappears.

The issue was that I had inconsistent traefik provider.yaml args on my control-nodes (aka nodes on the control plane)

/var/snap/microk8s/current/args/traefik/provider.yaml

So I went through each of my control nodes (4) and applied the SAME provider.yaml on each control node, and then restarted microk8s from the master node.

Then did a "sudo snap microk8s stop && sudo snap microk8s start" on the affected worker node.
The worker node got the provider.yaml as I defined them on the control nodes.

I'm hoping that this is going to be persistent. If I'm back here in two weeks I'll let you know if it worked.

Mbd06b avatar May 08 '24 18:05 Mbd06b

Hi. still got same problem on v1.29.7 14 servers with 6 of them is control planes.

Ochita avatar Aug 14 '24 18:08 Ochita