Improve ACL errors context to make figuring workable ACLs out simpler
ACLs in consul are a mess to navigate through. Any given agent may be making requests with half a dozen different tokens (acl.tokens.default, acl.tokens.agent, service.token, consul connect envoy ...) and it's not clear which token is being used when a permission error occurs. It also doesn't help me to know which permission is missing.
The guides and documentation are not super well organized around ACLs to even know which permissions are going to be needed for which purposes, so ACLs are already a game of whackamole. The error messages giving more context would be a huge step forward; at least then I know where the moles are to be able to whack them.
Feature Description
In general, every "permission denied" error should show the Accessor ID and "slot" (acl.tokens.default, acl.tokens.agent, etc) that the token came from, to hint to the user which type of request uses which token "slot"
@chrisjohnson: this is a mock-up of a potential direction we could take to make this better. Because it hasn't been vetted by others yet (engineering, design), it may need to change. However, I wanted to share it in this early form to get feedback from you (and anyone else viewing this):
How does this compare to what you were hoping for?
- What feels unnecessary or unhelpful?
- What feels like its missing or could be improved?
- What's your preference between the options?
This just shows potential CLI changes. There would be corresponding HTTP API changes. And we'd also want to reflect this info on the GUI. Do you have a sense of how you'd prefer to explore such information, and why? (HTTP API, CLI, or GUI?)
Workflow for Resolving a Token
Task: A user sees an ACL permission denied error indicating that the ACL token used for the operation lacks the appropriate access to the requested resource. The user knows that they have namespace default policies in place. The user wants to troubleshoot - to understand why access was denied and change things so access will be approved next time.
Current Workflow
Question: Is this an accurate reflection of the process today? What should I correct?
- See generic “Permission Denied” error message
- Somehow infer which token was used
- Somehow infer which operation failed, consult the docs for which permission is missing
- Inspect the token, get back the policies, node identities, service identities, and roles
- Inspect all the policies, get back rules
- Inspect all the roles, get back policies
- Inspect all the policies, get back rules
- Notice that the token output doesn’t show the namespace defaults. Run consul namespace read ns, get back the policies and roles
- Inspect all the policies, get back rules
- Inspect all the roles, get back policies
- Inspect all the policies, get back rules
- Possibly check the default policy (deny or accept)
- Manually review all the information from steps 4-6 to try to understand why permission was denied
- If you can’t understand why permission was denied, return to step 2 or 3 because you might have been wrong
Proposed Workflow
- See a more detailed “Permission Denied” error message describing (1) what permission was lacking on (2) which resource and (3) how to get more information on why this is the case for the provided token.
- Use a CLI command to understand the compiled permissions of that token and how they apply to the resource in step 1.
- Based on step 2, modify policies as needed to obtain the necessary permissions, or use a different token (which can be checked using the CLI command in step 2).
Error Message Improvement (Permission Denied)
Current Message
Just says "Permission denied":
2021-07-15T17:03:28.642-0400 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=testing-ns/
myservice-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call:
Permission denied"
Proposed Message
Provides additional information, including:
- From where the token was specified
- What permission is lacking (
read) for which resource (service) and which label (myservice-sidecar-proxy) - How to get more info (run
consul acl access explain ...)
2021-07-15T17:03:28.642-0400 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=testing-ns/
myservice-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call:
Permission denied: ACL token from agent config entry 'acl.tokens.default' lacks permission 'service:read' on service
'myservice-sidecar-proxy'; for more info, run: consul acl access explain
-token=2b58e043-178d-8f43-fb74-4ef511f3c0ac -resource=service -label='myservice-sidecar-proxy'"
Question: thoughts on this revised message? Any concerns about it? Info you think is important but missing?
Understanding an Authorization Enforcement Decision
The current process is to use the "Current Workflow" steps 2-8. The new process would add utilities for "Proposed Workflow" steps 2-3. (Long-term, this information would be easiest to present in a GUI, but it would likely start with a CLI command.)
Explain access for a given token, resource, and label
The proposed error message above tells you what you need to run this command to explain a "Permission denied":
for more info, run: consul acl authorizer explain -token=2b58e043-178d-8f43-fb74-4ef511f3c0ac -resource=service -label='myservice-sidecar-proxy'"
Provide the following details about the decision:
- resolved access level
- which policy source specified that access level
- why other policy sources were overridden, and what access level they otherwise would have specified
Questions:
- Is anything important missing? Or any of the above unnecessary?
- Thoughts on the command name? (
consul acl token access explain)
$ consul acl token access explain -id=<token> -resource=key -label=admin/secret
Access Level: deny
Enforcer:
Type: role - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - role name
-> policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
Enforcement Layer: Token (1/3)
Rule:
namespace_prefix "" > key_prefix "admin/" {
policy = "deny"
}
Overridden 1:
Override Reason: enforcer's "deny" takes precedence over "read"
Type: policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
Enforcement Layer: Token (1/3)
Rule:
namespace_prefix "" > key_prefix "admin/" {
policy = "read"
}
Overridden 2:
Override Reason: enforcer's rule has a longer prefix match
Type: policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
Enforcement Layer: Token (1/3)
Rule:
namespace_prefix "" > key_prefix "" {
policy = "list"
}
Overridden 3:
Override Reason: enforcer's match occurs at a higher layer (Token - 1/3)
Type: policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
Enforcement Layer: Namespace Default (2/3)
Rule:
namespace_prefix "" > key_prefix "" {
policy = "deny"
}
Overridden 4:
Override Reason: enforcer's match occurs at a higher layer (Token - 1/3)
Type: Default Policy
Enforcement Layer: Default Policy (3/3)
Rule:
namespace_prefix "" > key_prefix "" {
policy = "deny"
}
Read the resolved access for a given token
The command described above is primarily intended to explain an enforcement decision (such as in response to an error message). This command is instead focused on explaining the compiled ruleset for a token and how to modify it (based on the policy source).
The usage shown below is the full output. It could also allow filtering by a resource (exact match or prefix) or namespace (exact match or prefix) to make it easier to view only what you need.
Questions:
- Thoughts on the value (or not) of this command? Anything you'd recommend changing?
- Thoughts on the command name? (
consul acl token access read)
$ consul acl token access read -namespace=testing-rx3 -id=2b58e043-178d-8f43-fb74-4ef511f3c0ac
Token:
AccessorID: 2b58e043-178d-8f43-fb74-4ef511f3c0ac
SecretID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Namespace: testing-rx3
Description: service.token -- mitrx3n1.cmmint.net
Local: false
Create Time: 2021-10-07 10:35:40.125000128 -0400 -0400
Rules:
Resource “key”:
Layer 0: Token
namespace_prefix “”:
key_prefix “”: list (role - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - role name)
key_prefix “admin/”: deny (policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name)
key “admin/test”: read (policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name)
key “app/”: write (policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name)
Layer 1: Namespace Defaults
Layer 2: Default Policy
namespace_prefix “”:
key_prefix “”: deny
Resource “node”:
Layer 0: Token
... enumerate the resolved rules as above for Resource "key" ...
... enumerate the other types of resources ...
Potential call signature improvement...
Original proposal:
$ consul acl token access explain -id=<token> -resource=key -label=admin/secret
Alternative:
$ consul acl token access explain -id=<token> -key=admin/secret
$ consul acl token access explain -id=<token> -service=myservice-sidecar-proxy
... -<resource type>=<resource label>
@chrisjohnson : Consul 1.12 will include more verbose ACL error messages!
Instead of just Permission denied, they will be something like:
Permission denied: token with AccessorID '8a2d52a0-6b41-7077-8374-09d4fafa2d30' lacks permission 'service:read' on "foobar" in partition "foo", namespace "bar"
I'll leave this issue open because there are further improvements that could be made in the future (stating the "slot" that a token comes from, the "explain" functionality).
Relevant PRs: #12308, #12470, #12550, #12567, #12597, #12620
Did the ACL expanders/explainers ever go anywhere? While the better errors help with some scenarios, they do not in ones where an empty response is given instead of a 403. For example, listing the catalog services.