tailscale icon indicating copy to clipboard operation
tailscale copied to clipboard

auth keys can cause client to get into registration loop

Open willnorris opened this issue 2 years ago • 7 comments

We've had two customers report issues (go/hs/11584 and go/hs/16655) where a client has gotten into a registration loop using auth keys. Opening this to have a public issue to track.

My simplest reproduction steps:

  • register a node with an auth key: tailscale up --authkey=tskey-auth-XXX (you could probably also use an authpath url flow)
  • delete node from admin console
  • try to register the node again with a new auth key: tailscale up --authkey=tskey-auth-YYY

What we see in the client is an error that the new auth key is invalid:

backend error: invalid key: API key kKyP1T4CNTRL not valid

The client logs show the repeated registration attempts:

control: WaitLoginURL: register request: Post "https://controlplane.tailscale.com/machine/register": context canceled
control: authRoutine: state:url-visit-required; wantLoggedIn=true
control: authRoutine: quit
control: mapRoutine: context done.
control: mapRoutine: state:url-visit-required
control: mapRoutine: quit
control: authRoutine: state:new; goal=nil paused=false
Backend: logs: be:2071d1d011f19dd0e2f2c69196996edff28e6bcbee56b30ba40e21eaab720912 fe:
control: Client.Shutdown done.
control: mapRoutine: state:new
control: client.Login(false, 0)
control: authRoutine: context done.
control: authRoutine: state:new; wantLoggedIn=true
control: direct.TryLogin(token=false, flags=0)
control: doLogin(regen=false, hasUrl=false)
control: control server key from https://controlplane.tailscale.com: ts2021=[fSeS+], legacy=[nlFWp]
control: RegisterReq: onode= node=[VkpF4] fup=false nks=false
control: creating new noise client
control: sockstats: trace "ControlClientAuto" was overwritten by another
control: RegisterReq: got response; nodeKeyExpired=true, machineAuthorized=false; authURL=false
control: server reports new node key [VkpF4] has expired
control: doLogin(regen=true, hasUrl=false)
control: Generating a new nodekey.
control: RegisterReq: onode=[VkpF4] node=[FV2Zi] fup=false nks=false
control: RegisterReq: got response; nodeKeyExpired=false, machineAuthorized=false; authURL=false
control: TryLogin: invalid key: API key kKyP1T4CNTRL not valid
control: sendStatus: authRoutine-report: state:authenticating
Received error: invalid key: API key kKyP1T4CNTRL not valid
control: authRoutine: backoff: 10 msec
control: authRoutine: state:authenticating; wantLoggedIn=true
control: direct.TryLogin(token=false, flags=0)
control: doLogin(regen=false, hasUrl=false)
control: RegisterReq: onode= node=[VkpF4] fup=false nks=false
control: RegisterReq: got response; nodeKeyExpired=false, machineAuthorized=false; authURL=false
control: TryLogin: invalid key: API key kKyP1T4CNTRL not valid
control: sendStatus: authRoutine-report: state:authenticating
Received error: invalid key: API key kKyP1T4CNTRL not valid

In the audit logs, we can see that the first use of the auth key was partially successful, but clearly the client isn't updating its state.

Screenshot 2023-08-22 at 5 15 27 PM

My notes from when this was originally reported a couple of months ago:

Okay, Maisem and I have somewhat of a better idea of what's going on, though still don't have an obvious long-term fix right now. It will likely require a client change, so not anything we can simply fix on control and redeploy. The client is getting into a weird state by switching back and forth between oauth-derived keys and user-owned keys. It also seems to get into a bad state when the device is deleted in the admin console, and then tailscale up is rerun with an oauth-derived key.

In both of these cases, the current short-term solution is to either a) delete the state directory (/var/lib/tailscale on linux) and start clean or b) run tailscale login instead of tailscale up. This will create a new profile on the device with a clean state, which avoids the above problems.

cc @maisem

willnorris avatar Aug 23 '23 00:08 willnorris

my original notes indicated that this was related to OAuth-derived auth keys. But my testing today, and the reproduction steps above, show that this can be caused with regular user-owned auth keys as well.

willnorris avatar Aug 23 '23 00:08 willnorris

stumbled into this today only deleting /var/lib/tailscale did help

lizdeika avatar Oct 11 '23 16:10 lizdeika

Ran into this today, details here

Whats different for me is that there is no /var/lib/tailscale persistence as its a container (k8s operator)

myoung34 avatar Jan 17 '24 16:01 myoung34

Running into similar problems using the artis3n.tailscale Ansible Galaxy package. Understood it's not an officially supported package, but the symptoms are identical to this.

The role worked the first time (to register the node), and all subsequent attempts to register the node using the API keys are failing. As per your Reddit post @myoung34, tokens are being issued and then immediately revoked.

zakisaad avatar Jan 18 '24 08:01 zakisaad

Note for self: https://github.com/tailscale/corp/issues/6606 documents an issue where NodeKeyData.Revoked is getting set incorrectly. Since we're seeing log messages of server reports new node key [...] has expired, it's possible this is related.

willnorris avatar Jan 18 '24 16:01 willnorris

Sharing a couple additional tickets related to this as well. It should be noted for both of these cases, the steps taken to repro the issue are to directly pass the OAuth client secret as the auth-key value during invocation of tailscale up, leading to the same API key abc123CNTRL not valid error.

I have suggested using the get-authkey utility as a workaround in the meantime.

garrett-ts avatar Feb 15 '24 14:02 garrett-ts

Commenting here for visibility, I can't get this to work with an Auth Key either. Currently I'm unable to use Oauth in any configuration for authorizing machines, which is a necessary workflow for me. I'm also able to replicate the issue using tsnet with a stateless (no /var/lib/tailscale) setup. In my case it appears as though oauth is entirely broken.

echo "generating tailscale oauth client token"
export TS_ACCESS_TOKEN=$(curl -s -d "client_id=${OAUTH_CLIENT}" -d "client_secret=${OAUTH_SECRET}" "https://api.tailscale.com/api/v2/oauth/token" | jq -r .access_token)

echo "generating device authorization key"
export TS_AUTH_KEY=$(curl -s -X POST -H "Authorization: Bearer ${TS_ACCESS_TOKEN}" -d '{"capabilities": {"devices":{"create":{"ephemeral":false,"preauthorized":true,"tags":${tags}}}},"expirySeconds": 30}' "https://api.tailscale.com/api/v2/tailnet/${tailscale_domain}/keys" |jq -r .key)```

charles-d-burton avatar Feb 15 '24 15:02 charles-d-burton

Troubleshooting over the weekend on a related issue in a personal project helped me figure out how to get past this bug. If I add the "reusable: true" flag to the API request it works again. If I had to hazard a guess (I haven't had time to dig into the code) the generated key is getting used somewhere before it's being passed to the tailscale up command. While this isn't optimal, I do set the timeout pretty low on the key so for my use-case it's fine. I do think there still a bug in here that needs addressed.

charles-d-burton avatar Feb 20 '24 20:02 charles-d-burton

This turned out to be a bug in our control server when registering devices using an auth key (not a client side bug as previously thought). The fix has been deployed, and I've tested that re-registration as described in the original issue now works as expected.

If anyone continues to see this or similar, please let me know.

willnorris avatar Feb 22 '24 18:02 willnorris

Still experiencing this on windows

Neiva07 avatar Feb 22 '24 23:02 Neiva07

@Neiva07 can you say what steps you're following and what you're seeing?

willnorris avatar Feb 23 '24 00:02 willnorris