teleport add self-repair for malformed instance certs and explicitly disallow future mix-and-match of join tokens

add self-repair for malformed instance certs and explicitly disallow future mix-and-match of join tokens

Open fspmarshall opened this issue 9 months ago • 0 comments

NOTE: still needs more test-coverage, that will be incoming.

Problem Summary

Historically, it has been possible to set up an agent with multiple separate system roles by mixing and matching permissions from different tokens over time. For example, an ssh agent could initially be set up using a token that only grants the Node system role. Later, that token could be swapped out for one that only grants the Kube system role and the associated service could be activated. The new Kube service would be added, and from the point of view of the teleport cluster it was like a new cube-only agent had joined that happened to have the same server ID as a pre-existing ssh agent. This was an artifact of how certificate acquisition worked rather than an intentional feature. Individual per-service certificates don't really care if they originate from the same join token or not, but the primary "instance cert" encodes all system roles granted by the initial join token, and becomes out of sync with the agent's current set of system roles when mix-and-match occurs.

On the more benign end of the spectrum, this desync between an agents active services and its instance role set would just cause problems with instance heartbeats not showing some services correctly (e.g. https://github.com/gravitational/teleport/issues/38977). For services that make heavier use of the Instance cert, such as the ssh service, the service would be thrown into an error loop and be unable to function due to permission errors.

Discussion of Changes

There are two alternative ways to tackle this problem:

Preserve the ability to mix-and-match, and allow teleport agents to "merge" permissions granted by multiple separate tokens to get a new instance cert that encodes all system roles regardless of which token granted which.
Preserve the one-to-one mapping of instance cert to join token and disallow mix-and-match of tokens going forward.

After some deliberation, I've opted to go with the latter option and to preserve the one-to-one mapping. The main reason for doing is is because I believe that strategy is more compatible with a number of future features that are currently in discussion. Most notably, statically enforcing labels/limits/scopes via join tokens. So long as join tokens only grant system roles, mix-and-match isn't really a big deal as long as teleport updates its certs correctly, but mix-and-match makes more granular agent controls onerous to implement. By preserving the one-to-one mapping between join tokens and agent identities, we ensure that all future agent access-control systems are easy to reason about and are not forced to come up with their own mix-and-match model in order to remain sane.

In order to preserve backwards compatibility with existing agents that have already been configured with permissions from a mixed set of tokens, a temporary merge strategy is added in this PR. For system roles added before v16 that are not present on the instance cert, agents will be able to prove that they hold credentials for those system roles via an assertion and then reissue their instance certs to include the dangling system role. This will ensure that existing agents self-repair rather than entering a broken state.

All new attempts to add additional system roles via mix-and-match will be rejected and the agent will refuse to start. From v16 onwards, the recommended strategy for adding additional system roles to an agent not authorized by its initial join token will be to do a full reset of that agent, deleting its state directory and giving it a new join token that authorizes all desired system roles.

Two notable cases that require special handling are changes to auth server service sets, and "hosted plugins" that double-duty as teleport services and have their own system roles/certificates (some do and some don't).

For auth servers, it isn't necessarily reasonable to require resetting local agent state when changing the system roles of the agent since some auth servers may be using local storage for cluster state and/or audit log (though this isn't recommended for production deployments). In order to sidestep this issue, any agent running an auth service is treated as a special case and will always regenerate its instance cert, even when the system role being added was done so post-v16.

For hosted plugins that double-duty as teleport services, the current behavior treats the allocation of the plugin certificate as equivalent to a new system role being added to the agent during normal runtime. This trips up the instance cert reissue logic, causing the instance cert to be spuriously regenerated. To address this issue, hosted plugin roles are now tracked separately from static service roles and are ignored by the new instance cert logic in this PR.

Summary of User-Facing Changes

When upgrading, teleport agents that previously had been granted multiple system roles from different tokens will perform a certificate reissue and reload shortly after startup (just like if an agent's address/hostname was changed since the last restart).
Starting in v16, agents will reject mix-and-match of permissions from multiple join tokens and direct users to reset the agent state if mix-and-match was intentional.

Fixes: https://github.com/gravitational/teleport/issues/38977

changelog: fixed an issue where mix-and-match of join tokens could interfere with some services appearing correctly in heartbeats. mix-and-match of join tokens will be explicitly rejected in v16 onwards.

May 13 '24 15:05 fspmarshall

teleport teleport copied to clipboard

add self-repair for malformed instance certs and explicitly disallow future mix-and-match of join tokens

Problem Summary

Discussion of Changes

Summary of User-Facing Changes

teleport
teleport copied to clipboard