k3s icon indicating copy to clipboard operation
k3s copied to clipboard

[k3s-upgrade] k3s service failed to start after upgrade

Open ac5tin opened this issue 2 years ago • 12 comments

Environmental Info: K3s Version:

k3s version v1.23.4+k3s1 (43b1cb48)
go version go1.17.5

Node(s) CPU architecture, OS, and Version:

5.4.0-1056-raspi #63-Ubuntu
aarch64 aarch64 aarch64 GNU/Linux

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

Describe the bug: I tried to upgrade the k3s version of my cluster (master node and worker nodes) by following this : k3s-upgrade

Steps To Reproduce:

kubectl apply -f https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml

# master nodes
kubectl label node <node-name> k3s-master-upgrade=true
# worker nodes
kubectl label node <node-name> k3s-worker-upgrade=true

# apply upgrade plan
kubectl apply -f agent.yml
kubectl apply -f server.yml

my plans: server.yml

# Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: k3s-master-upgrade
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.23.4+k3s1

agent.yml

# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: k3s-worker-upgrade
      operator: In
      values:
      - "true"
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/k3s-upgrade
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.23.4+k3s1

Expected behavior: All nodes to upgrade successfully to k3s version 1.23.4+k3s1

Actual behavior: master node k3s updated the k3s binary on the machine but failed to start the service

Additional context / logs:

Mar 28 09:25:54 huey sh[3502]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 28 09:25:54 huey sh[3508]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Mar 28 09:25:55 huey k3s[799]: time="2022-03-28T09:25:55Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Starting k3s v1.23.4+k3s1 (43b1cb48)"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Database tables and indexes are up to date"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Kine available at unix://kine.sock"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS>
Mar 28 09:25:56 huey systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE

ac5tin avatar Mar 28 '22 09:03 ac5tin

Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS>

It looks like the --token value in your config file or systemd unit is in an invalid format. How have you specified it?

brandond avatar Mar 28 '22 17:03 brandond

i haven't changed the the config file. Not sure if it got modified by the update process? I had to completely uninstall k3s and reinstall from scratch

ac5tin avatar Apr 07 '22 08:04 ac5tin

@brandond Is there any way to figure out what the token should be in case it got removed in the k3s/server/token file?

vvanouytsel avatar Aug 17 '22 14:08 vvanouytsel

No, if you were not manually configuring the token, and all nodes with a copy of the token file have been lost, there is no way to recover the value with only a copy of the datastore.

brandond avatar Aug 17 '22 16:08 brandond

No, if you were not manually configuring the token, and all nodes with a copy of the token file have been lost, there is no way to recover the value with only a copy of the datastore.

Is it also stored in etcd (or sqlite by default on k3s)?

vvanouytsel avatar Aug 18 '22 12:08 vvanouytsel

The bootstrap data (cluster CA certificates and such) are stored in the datastore, encrypted with the token as the key generation passphrase. The token value cannot be extracted from the datastore; that would render the encryption meaningless.

brandond avatar Aug 19 '22 17:08 brandond

I deleted the k3s/server/token file from the filesystem and restarted the k3s systemd service. In my case k3s was able to restore the contents of that file.

vvanouytsel avatar Aug 22 '22 08:08 vvanouytsel

If you delete that file but the token is not specified elsewhere (in the config or on the CLI), then a new one will be generated on startup. This is most likely fine on single-server clusters, but it will cause problems when using etcd or an external SQL datastore.

brandond avatar Aug 22 '22 18:08 brandond

I am indeed running a single-server cluster. Thanks for your explanation!

vvanouytsel avatar Aug 22 '22 18:08 vvanouytsel

What about multi-node clusters? I ran into this issue while trying to upgrade an agent node from 1.22.6+k3s1 to the latest. Can I just grab the token from another node and force inject it during the upgrade? The weirdest part is that it's communicating with the cluster just fine.

bramnet avatar Aug 23 '22 01:08 bramnet

@bramnet this issue has wandered a bit; I may need to lock it so that folks can open their own issues describing their individual problems. What is the exact message you're getting?

brandond avatar Aug 23 '22 03:08 brandond

I was just trying again to reproduce it, and suddenly it’s saying the node is up to date… not sure what happened here. All I remember is that it was very similar to what ac5tin had in the 2nd to last line in their logs: level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS> What’s also weird is Rancher isn’t reflecting they’re up to date… I’ll have to look into that.

bramnet avatar Aug 23 '22 04:08 bramnet

I'm having the same issue on a single node cluster. I noticed that /var/lib/rancher/k3s/server/token has recently been written and is now empty.

RaphaelKimmig avatar Nov 02 '22 07:11 RaphaelKimmig

same here using single master mode, vesrion v1.25.3+k3s1, I resolved this by delete the empty file /var/lib/rancher/k3s/server/token

ryan4yin avatar Nov 10 '22 18:11 ryan4yin

I'm not aware of any paths in the k3s code that would cause it to write an empty token file. If anyone else runs into this, and can confirm that they are not using any automation or scripting to manage the content of that file, please open a new issue with steps that can help us reproduce this.

brandond avatar Jun 13 '23 17:06 brandond