k3s
k3s copied to clipboard
[k3s-upgrade] k3s service failed to start after upgrade
Environmental Info: K3s Version:
k3s version v1.23.4+k3s1 (43b1cb48)
go version go1.17.5
Node(s) CPU architecture, OS, and Version:
5.4.0-1056-raspi #63-Ubuntu
aarch64 aarch64 aarch64 GNU/Linux
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
Describe the bug: I tried to upgrade the k3s version of my cluster (master node and worker nodes) by following this : k3s-upgrade
Steps To Reproduce:
kubectl apply -f https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml
# master nodes
kubectl label node <node-name> k3s-master-upgrade=true
# worker nodes
kubectl label node <node-name> k3s-worker-upgrade=true
# apply upgrade plan
kubectl apply -f agent.yml
kubectl apply -f server.yml
my plans:
server.yml
# Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: k3s-master-upgrade
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.23.4+k3s1
agent.yml
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: k3s-worker-upgrade
operator: In
values:
- "true"
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.23.4+k3s1
Expected behavior:
All nodes to upgrade successfully to k3s version 1.23.4+k3s1
Actual behavior: master node k3s updated the k3s binary on the machine but failed to start the service
Additional context / logs:
Mar 28 09:25:54 huey sh[3502]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 28 09:25:54 huey sh[3508]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Mar 28 09:25:55 huey k3s[799]: time="2022-03-28T09:25:55Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Starting k3s v1.23.4+k3s1 (43b1cb48)"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Database tables and indexes are up to date"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Kine available at unix://kine.sock"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS>
Mar 28 09:25:56 huey systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS>
It looks like the --token
value in your config file or systemd unit is in an invalid format. How have you specified it?
i haven't changed the the config file. Not sure if it got modified by the update process? I had to completely uninstall k3s and reinstall from scratch
@brandond
Is there any way to figure out what the token should be in case it got removed in the k3s/server/token
file?
No, if you were not manually configuring the token, and all nodes with a copy of the token file have been lost, there is no way to recover the value with only a copy of the datastore.
No, if you were not manually configuring the token, and all nodes with a copy of the token file have been lost, there is no way to recover the value with only a copy of the datastore.
Is it also stored in etcd
(or sqlite
by default on k3s)?
The bootstrap data (cluster CA certificates and such) are stored in the datastore, encrypted with the token as the key generation passphrase. The token value cannot be extracted from the datastore; that would render the encryption meaningless.
I deleted the k3s/server/token
file from the filesystem and restarted the k3s
systemd service. In my case k3s
was able to restore the contents of that file.
If you delete that file but the token is not specified elsewhere (in the config or on the CLI), then a new one will be generated on startup. This is most likely fine on single-server clusters, but it will cause problems when using etcd or an external SQL datastore.
I am indeed running a single-server cluster. Thanks for your explanation!
What about multi-node clusters? I ran into this issue while trying to upgrade an agent node from 1.22.6+k3s1 to the latest. Can I just grab the token from another node and force inject it during the upgrade? The weirdest part is that it's communicating with the cluster just fine.
@bramnet this issue has wandered a bit; I may need to lock it so that folks can open their own issues describing their individual problems. What is the exact message you're getting?
I was just trying again to reproduce it, and suddenly it’s saying the node is up to date… not sure what happened here. All I remember is that it was very similar to what ac5tin had in the 2nd to last line in their logs: level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS> What’s also weird is Rancher isn’t reflecting they’re up to date… I’ll have to look into that.
I'm having the same issue on a single node cluster. I noticed that /var/lib/rancher/k3s/server/token
has recently been written and is now empty.
same here using single master mode, vesrion v1.25.3+k3s1
, I resolved this by delete the empty file /var/lib/rancher/k3s/server/token
I'm not aware of any paths in the k3s code that would cause it to write an empty token file. If anyone else runs into this, and can confirm that they are not using any automation or scripting to manage the content of that file, please open a new issue with steps that can help us reproduce this.