talos icon indicating copy to clipboard operation
talos copied to clipboard

ServiceAccounts for several different apps intermittently rejected with "invalid bearer token" error by kube-apiserver

Open v1nsai opened this issue 1 year ago • 6 comments

Bug Report

Description

I have observed several apps getting 401 errors intermittently from kube-apiserver. I'm very confused that the problem is intermittent, I would expect either 100% failure or no issues from invalid tokens.

For example, after installing homepage with default settings outlined by the docs and enabling metrics-server as per the Talos docs, the logs for homepage frequently show Bearer token issues like the following:

[2024-08-11T15:38:03.165Z] [31merror[39m: <kubernetes-widget> Error getting metrics, ensure you have metrics-server installed: s {"response":{"statusCode":401,"body":"{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Unauthorized\",\"reason\":\"Unauthorized\",\"code\":401}\n","headers":{"audit-id":"1ebd468d-bcaf-4932-82ae-2984e514b3f3","cache-control":"no-cache, private","content-type":"application/json","date":"Sun, 11 Aug 2024 15:38:03 GMT","content-length":"129","connection":"close"},"request":{"uri":{"protocol":"https:","slashes":true,"auth":null,"host":"10.96.0.1:443","port":"443","hostname":"10.96.0.1","hash":null,"search":null,"query":null,"pathname":"/apis/metrics.k8s.io/v1beta1/nodes","path":"/apis/metrics.k8s.io/v1beta1/nodes","href":"https://10.96.0.1:443/apis/metrics.k8s.io/v1beta1/nodes"},"method":"GET","headers":{"Authorization":"Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjI0cVZxYkI1Rlo0YVVSaGVvbWlQaFJTUmVfTVdBX1Z2LV9WdTFkTFAtcTgifQ.eyJhdWQiOlsiaHR0cHM6Ly8xOTIuMTY4LjEuMTcwOjY0NDMiXSwiZXhwIjoxNzU0OTI2NjQxLCJpYXQiOjE3MjMzOTA2NDEsImlzcyI6Imh0dHBzOi8vMTkyLjE2OC4xLjE3MDo2NDQzIiwianRpIjoiOTA3YmMyNmEtZTZiZi00OTMzLWEwMWYtNTE1M2FiNjhkNzZlIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJob21lcGFnZSIsIm5vZGUiOnsibmFtZSI6ImJpZ3JpZyIsInVpZCI6IjFiYmY2NzVlLWNlODEtNDc2Ny1iOGI1LTlkOGM4Y2IwMWJhMyJ9LCJwb2QiOnsibmFtZSI6ImhvbWVwYWdlLTdiYjY3NmNiNWQtN3d0aGMiLCJ1aWQiOiJmNzI5ODQ0OS0yYzg5LTQ2ZTMtYmMxNy1kMGU3YmMxYzMyNjQifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImhvbWVwYWdlIiwidWlkIjoiZGRlMzExMzAtMmFiNi00MTE0LTgwYjgtMjVkOThmYzgxMmQ1In0sIndhcm5hZnRlciI6MTcyMzM5NDI0OH0sIm5iZiI6MTcyMzM5MDY0MSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OmhvbWVwYWdlOmhvbWVwYWdlIn0.nUVIw2zItBYjOkmQr2S7ImGPNFjauzSZ4NDNFuNo9Z5i59x2f2T-260FoswFW5nsjWQl89NllTfqF0m6PhXD0iCJgnI5MwZWxZUnh6Sv_OD1k-axHB490jb7Zba6249EaB1dHhMoaCAaDt2Q2XTrR83ePLx8wknSISl_ntryvRZ4IgONWFLIeONNlw5g1fVBabff5eDdMRQGYszF3yTL645Sp3gtM3rOAdRbLeowFqwYk0-PSXE4wG5uFRY2CMejRBfXBOyyACQoTjMBEqbeKXWnQAenpL6CjAg3Qfh7psiBS0t8BQNC103ptyQhsn0LwNFL2z1bK-An_P_iYkc-kOcaX7EOdkdsEL8rLxM2gQBcNsqN45Uj1xvrnHmk1FfiuDcfAGVCs5YBE6HZgFn2h68i8Ih_RODzgAYfhQwcdTMEn8pInieBYeh_daUSkWi15eylbarkwRlm38V5NY7re-RX7MlHiXcT20iqi6vOHKiGnbpW9XQtSjxKDzwHgK5J-nZJzp0YHer6BTFh6UVdIIMJU9NB6LOQka1zyHqi__K7POW68u4Uzfzl1xTpyVa0g9gtMb5wfJFjcF37uAaLqEWOlUVtqeIBIcg8wv1H9xS5OKLxryhvBbmq428QDgLVWngzM-aediTbC4BpgsNaO4wQQ2umlFJbQNUf2HHCuR0"}}},"body":{"apiVersion":"v1","code":401,"kind":"Status","message":"Unauthorized","metadata":{},"reason":"Unauthorized","status":"Failure"},"statusCode":401,"name":"HttpError"}
[2024-08-11T15:38:09.256Z] [31merror[39m: <kubernetes-widget> Error getting ingresses: 401 {
  kind: 'Status',
  apiVersion: 'v1',
  metadata: {},
  status: 'Failure',
  message: 'Unauthorized',
  reason: 'Unauthorized',
  code: 401
} IncomingMessage {
  _readableState: [ReadableState],
  _events: [Object: null prototype],
  _eventsCount: 4,
  _maxListeners: undefined,
  socket: [TLSSocket],
  httpVersionMajor: 1,
  httpVersionMinor: 1,
  httpVersion: '1.1',
  complete: true,
  rawHeaders: [Array],
  rawTrailers: [],
  joinDuplicateHeaders: undefined,
  aborted: false,
  upgrade: false,
  url: '',
  method: null,
  statusCode: 401,
  statusMessage: 'Unauthorized',
  client: [TLSSocket],
  _consuming: false,
  _dumped: false,
  req: [ClientRequest],
  request: [Request],
  toJSON: [Function: responseToJSON],
  caseless: [Caseless],
  body: [Object],
  [Symbol(kCapture)]: false,
  [Symbol(kHeaders)]: [Object],
  [Symbol(kHeadersCount)]: 12,
  [Symbol(kTrailers)]: null,
  [Symbol(kTrailersCount)]: 0
}

I have confirmed that the jwt token does not have an expiration, and the token works at least some of the time, so the token seems to be valid and should not be getting a 401.

I observed this behavior on Talos 1.7.6 before downgrading to 1.7.5 to try to solve it, still seeing the issue though.

Logs

The following is repeated about every 20 seconds in the kube-apiserver logs

E0811 15:49:19.574078       1 authentication.go:73] "Unable to authenticate the request" err="invalid bearer token"

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>] Client: Tag: v1.7.5 SHA: 47731624 Built: Go version: go1.22.4 OS/Arch: darwin/arm64 Server: NODE: 192.168.1.162 Tag: v1.7.6 SHA: ae67123a Built: Go version: go1.22.5 OS/Arch: linux/amd64 Enabled: RBAC NODE: 192.168.1.170 Tag: v1.7.5 SHA: 47731624 Built: Go version: go1.22.4 OS/Arch: linux/amd64 Enabled: RBAC NODE: 192.168.1.155 Tag: v1.7.5 SHA: 47731624 Built: Go version: go1.22.4 OS/Arch: linux/amd64 Enabled: RBAC

  • Kubernetes version: [kubectl version --short] Client Version: v1.30.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.1

  • Platform:

v1nsai avatar Aug 11 '24 16:08 v1nsai

One thing to check is that the time is in sync across controlplane machines, and they all have matching PKI (it should if you generated the config and applied to all of them).

Also in the kube-apiserver logs it says why the request was rejected, sometimes hard to find, but it does.

smira avatar Aug 12 '24 18:08 smira

hey @smira thanks for taking a look at this, I'm tearing my hair out trying to find whats happening.

I wiped my cluster and had no issues with a single node, but the same issue as soon as I added a second node. Both nodes are using nearly identical machine config patches over the defaults:

machine:
  install:
    disk: /dev/sdc
    wipe: true
  network:
    hostname: nodename
    interfaces:
      - interface: eno1
        dhcp: true
        vip: 
          ip: 192.168.1.133
  kubelet:
    extraMounts:
      - destination: /var/mnt/disk0
        type: bind
        source: /var/mnt/disk0
        options:
          - bind
          - rshared
          - rw
  disks:
    - device: /dev/sda
      partitions:
        - mountpoint: /var/mnt/disk0
  certSANs:
    - 192.168.1.155 # node1
    - 192.168.1.162 # node2
    - 192.168.1.170 # node3
    - 192.168.1.133 # vip address
cluster:
  apiServer:
    certSANs:
    - 192.168.1.155 # node1
    - 192.168.1.162 # node2
    - 192.168.1.170 # node3
    - 192.168.1.133 # vip address
  allowSchedulingOnControlPlanes: true

Time syncing is working fine, I even tried adding some servers physically close to me, but the errors persisted. I also checked the logs for controller-manager, etcd and apid. All look fine, no errors. The PKI all match in the machineconfigs. The only place I'm seeing any errors is in the kube-apiserver-nodename logs, and the logs of whatever app is using the ServiceAccount.

I ran Ubuntu and microk8s on this same hardware for several months without this issue, so a hardware issue is unlikely but I suppose I should check again to see if anything has recently gone wrong.

Me and ChatGPT are stumped.

v1nsai avatar Aug 13 '24 15:08 v1nsai

I don't think this is related to Talos in any way, as it doesn't do anything about it, this is pure Kubernetes thing at this point.

It's hard to guess for me, probably increase log verbosity in kube-apiserver logs via extraArgs and capture more logs?

But I guess you need to know why it got rejected, after that things will be easier.

smira avatar Aug 13 '24 15:08 smira

Running talosctl health in general also makes sense to ensure you don't have etcd split-brain or something similar.

smira avatar Aug 13 '24 15:08 smira

talosctl health comes back OK on all nodes too. I kicked up the apiserver log verbosity but I don't see anything that looks related happening before or after the "invalid bearer token" errors like I pasted in the OP. Here's a gist with the log output if anyone wants to make sure I'm not blind.

I spun up Ubuntu + k3s on all my nodes and am having zero issues with auth or time or syncing, so its not a hardware issue either.

I guess I'll close this issue, I start seeing problems with near default Talos installations as soon as a second node is added, which means it MUST be something on my end or else everyone with a cluster of 2 or more nodes would be seeing it 🙃

v1nsai avatar Aug 14 '24 15:08 v1nsai

~~Removing the extra certSANs from the config I posted above has fixed this issue.~~

~~BUT WHY? This was the last culprit I suspected in my config, why would adding extra IP addresses to the cert hurt anything? My guess is that including a certSANs object causes it to overwrite addresses that would otherwise be added by the default certSANs object?~~

~~Will dig into the code more but this doesn't seem like expected behavior of the certSANs objects~~

edit: false alarm, it started happening again.

To be totally unambiguous, here's exactly what I'm doing:

talosctl gen config \
  --with-secrets path/to/pre-generated/secrets.yaml \
  --output-types controlplane \
  --output /tmp/node1.yaml \
  --force \
  talos-homelab https://192.168.1.170:6443
talosctl apply-config \
  --insecure \
  --nodes 192.168.1.170 \
  --file /tmp/node1.yaml \
  --config-patch @path/to/install-patch.yaml

And here's install-patch.yaml

machine:
  install:
    disk: /dev/nvme0n1
    wipe: true
  network:
    hostname: node1
cluster:
  allowSchedulingOnControlPlanes: true
  apiServer:
    v: 10 # kube-apiserver verbosity all the way up

Then bootstrapping the cluster

talosctl bootstrap \
  --nodes 192.168.1.170 \
  --endpoints 192.168.1.170
talosctl kubeconfig \
  --nodes 192.168.1.170 \
  --endpoints 192.168.1.170 

After repeating these steps for the second node (minus bootstrapping), and installing an app that uses a ServiceAccount (gethomepage in my case) I immediately start getting "invalid bearer token" errors. Even with apiserver verbosity all the way up, I get nothing other than "invalid bearer token" in the logs.

v1nsai avatar Aug 30 '24 17:08 v1nsai

I've encountered a similar issue while i was setting up my k8s cluster with talos. I have a multi-master HA config, and the problem was related to the cluster configuration specifically the controlplane.endpoint setting. The solution was to set the controlplane.endpoint to the vip address instead of a specific node's IP. (I couldn't find any related docs about this) Here's my corrected configuration:

  id: <id>
  secret: <secret>

  controlPlane:
    endpoint: https://192.168.1.60:6443 # <- here

c1sc0 avatar Sep 05 '24 20:09 c1sc0

Yes, you should never use different endpoints across controlplane nodes (this doesn't make much sense).

smira avatar Sep 06 '24 10:09 smira

I'm getting this same problem with one of my apps calling the API server, the certificate that was issued looks valid but I'm still getting invalid bearer token on the API server. I'm using the VIP as the control plane endpoints in all the node configurations, so I'm honestly a bit stumped why this is giving me 401s.

I've recently nuked my cluster and rebootstrapped Talos (1.5.x) and K8s (1.28) to the latest version, this has suddenly become a problem I've not had before.

sillock1 avatar Sep 06 '24 12:09 sillock1

Oh hell that's exactly what I did. It looks like I misread this part of the docs, which I now see specifically refers to the talosconfig.

Will confirm later today

v1nsai avatar Sep 06 '24 13:09 v1nsai

This is definitely the problem, one of my nodes I missed one of the cluster endpoints, it was still pointing to my old haproxy load balancer rather than the VIP which was causing intermittent 401s when it hit that particular node.

sillock1 avatar Sep 07 '24 13:09 sillock1

Yep can confirm this fixed it for me as well. Thanks!

v1nsai avatar Sep 07 '24 15:09 v1nsai