auth keys can cause client to get into registration loop
We've had two customers report issues (go/hs/11584 and go/hs/16655) where a client has gotten into a registration loop using auth keys. Opening this to have a public issue to track.
My simplest reproduction steps:
- register a node with an auth key:
tailscale up --authkey=tskey-auth-XXX(you could probably also use an authpath url flow) - delete node from admin console
- try to register the node again with a new auth key:
tailscale up --authkey=tskey-auth-YYY
What we see in the client is an error that the new auth key is invalid:
backend error: invalid key: API key kKyP1T4CNTRL not valid
The client logs show the repeated registration attempts:
control: WaitLoginURL: register request: Post "https://controlplane.tailscale.com/machine/register": context canceled
control: authRoutine: state:url-visit-required; wantLoggedIn=true
control: authRoutine: quit
control: mapRoutine: context done.
control: mapRoutine: state:url-visit-required
control: mapRoutine: quit
control: authRoutine: state:new; goal=nil paused=false
Backend: logs: be:2071d1d011f19dd0e2f2c69196996edff28e6bcbee56b30ba40e21eaab720912 fe:
control: Client.Shutdown done.
control: mapRoutine: state:new
control: client.Login(false, 0)
control: authRoutine: context done.
control: authRoutine: state:new; wantLoggedIn=true
control: direct.TryLogin(token=false, flags=0)
control: doLogin(regen=false, hasUrl=false)
control: control server key from https://controlplane.tailscale.com: ts2021=[fSeS+], legacy=[nlFWp]
control: RegisterReq: onode= node=[VkpF4] fup=false nks=false
control: creating new noise client
control: sockstats: trace "ControlClientAuto" was overwritten by another
control: RegisterReq: got response; nodeKeyExpired=true, machineAuthorized=false; authURL=false
control: server reports new node key [VkpF4] has expired
control: doLogin(regen=true, hasUrl=false)
control: Generating a new nodekey.
control: RegisterReq: onode=[VkpF4] node=[FV2Zi] fup=false nks=false
control: RegisterReq: got response; nodeKeyExpired=false, machineAuthorized=false; authURL=false
control: TryLogin: invalid key: API key kKyP1T4CNTRL not valid
control: sendStatus: authRoutine-report: state:authenticating
Received error: invalid key: API key kKyP1T4CNTRL not valid
control: authRoutine: backoff: 10 msec
control: authRoutine: state:authenticating; wantLoggedIn=true
control: direct.TryLogin(token=false, flags=0)
control: doLogin(regen=false, hasUrl=false)
control: RegisterReq: onode= node=[VkpF4] fup=false nks=false
control: RegisterReq: got response; nodeKeyExpired=false, machineAuthorized=false; authURL=false
control: TryLogin: invalid key: API key kKyP1T4CNTRL not valid
control: sendStatus: authRoutine-report: state:authenticating
Received error: invalid key: API key kKyP1T4CNTRL not valid
In the audit logs, we can see that the first use of the auth key was partially successful, but clearly the client isn't updating its state.
My notes from when this was originally reported a couple of months ago:
Okay, Maisem and I have somewhat of a better idea of what's going on, though still don't have an obvious long-term fix right now. It will likely require a client change, so not anything we can simply fix on control and redeploy. The client is getting into a weird state by switching back and forth between oauth-derived keys and user-owned keys. It also seems to get into a bad state when the device is deleted in the admin console, and then tailscale up is rerun with an oauth-derived key.
In both of these cases, the current short-term solution is to either a) delete the state directory (/var/lib/tailscale on linux) and start clean or b) run tailscale login instead of tailscale up. This will create a new profile on the device with a clean state, which avoids the above problems.
cc @maisem
my original notes indicated that this was related to OAuth-derived auth keys. But my testing today, and the reproduction steps above, show that this can be caused with regular user-owned auth keys as well.
stumbled into this today
only deleting /var/lib/tailscale did help
Ran into this today, details here
Whats different for me is that there is no /var/lib/tailscale persistence as its a container (k8s operator)
Running into similar problems using the artis3n.tailscale Ansible Galaxy package. Understood it's not an officially supported package, but the symptoms are identical to this.
The role worked the first time (to register the node), and all subsequent attempts to register the node using the API keys are failing. As per your Reddit post @myoung34, tokens are being issued and then immediately revoked.
Note for self: https://github.com/tailscale/corp/issues/6606 documents an issue where NodeKeyData.Revoked is getting set incorrectly. Since we're seeing log messages of server reports new node key [...] has expired, it's possible this is related.
Sharing a couple additional tickets related to this as well. It should be noted for both of these cases, the steps taken to repro the issue are to directly pass the OAuth client secret as the auth-key value during invocation of tailscale up, leading to the same API key abc123CNTRL not valid error.
I have suggested using the get-authkey utility as a workaround in the meantime.
Commenting here for visibility, I can't get this to work with an Auth Key either. Currently I'm unable to use Oauth in any configuration for authorizing machines, which is a necessary workflow for me. I'm also able to replicate the issue using tsnet with a stateless (no /var/lib/tailscale) setup. In my case it appears as though oauth is entirely broken.
echo "generating tailscale oauth client token"
export TS_ACCESS_TOKEN=$(curl -s -d "client_id=${OAUTH_CLIENT}" -d "client_secret=${OAUTH_SECRET}" "https://api.tailscale.com/api/v2/oauth/token" | jq -r .access_token)
echo "generating device authorization key"
export TS_AUTH_KEY=$(curl -s -X POST -H "Authorization: Bearer ${TS_ACCESS_TOKEN}" -d '{"capabilities": {"devices":{"create":{"ephemeral":false,"preauthorized":true,"tags":${tags}}}},"expirySeconds": 30}' "https://api.tailscale.com/api/v2/tailnet/${tailscale_domain}/keys" |jq -r .key)```
Troubleshooting over the weekend on a related issue in a personal project helped me figure out how to get past this bug. If I add the "reusable: true" flag to the API request it works again. If I had to hazard a guess (I haven't had time to dig into the code) the generated key is getting used somewhere before it's being passed to the tailscale up command. While this isn't optimal, I do set the timeout pretty low on the key so for my use-case it's fine. I do think there still a bug in here that needs addressed.
This turned out to be a bug in our control server when registering devices using an auth key (not a client side bug as previously thought). The fix has been deployed, and I've tested that re-registration as described in the original issue now works as expected.
If anyone continues to see this or similar, please let me know.
Still experiencing this on windows
@Neiva07 can you say what steps you're following and what you're seeing?