nomad
nomad copied to clipboard
Add ACL Authentication and Authorization Metrics to Nomad
Proposal
Hello,
While configuring alerts for my Nomad cluster, I noticed that there are no existing metrics to track ACL-related events. Specifically, I am interested in monitoring denied RPC requests due to insufficient permissions.
It would be very useful to have built-in metrics that provide insights into authentication and authorization failures, including failed RPC requests due to ACL violations. This would enable more proactive alerting without relying on log scanning, which currently requires setting the log level to debug to capture failed authentication attempts—a less-than-ideal solution due to the increased verbosity.
Use-cases
The primary use case is to enhance security monitoring and infrastructure management by making it easier to detect and respond to failed authentication or authorization attempts. This could help identify potential security breaches or misconfigurations in a more efficient manner than parsing debug logs.
Just leaving a note here that we'll want to consider how much this feature request overlaps with the Nomad Enterprise audit feature.
Thanks for suggesting this, @econsult-devops ! My colleague @tgross have some expertise on that system, so I’d like to see what they might think of this. In the mean time if you’d like to submit a PR I’d be happy to review it for you.
I'm going to break my thoughts into two parts for authentication vs authorization. I'd recommend any implementation break the work into two separate changesets as well.
authentication
We currently have RPC rate metrics that are post-authentication, and those metrics can include labels for the token or role. There are a few potential places to hook metrics for authentication issues:
- Authenticate (and similar methods in
auth.go) would let us centralize metrics where authentication failures happen. By definition, these would not be able to have an authenticated identity in the labels but could give you a high level view of the rate of failure, and there'd be only one place to add the metric. - Post-forwarding in the RPC handlers, where we measure the RPC rates. For example, here in
NodePool.List. We could add a new label for authentication failures and simply reuse the existing rate metrics. - Post-forwarding in the RPC handlers, after we measure the RPC rates. For example, here in
NodePool.List. That would be a new metric.
If we were to implement this, I'm somewhat inclined to go with option (2) because it will be granular enough to be useful, while adding cardinality to existing metrics rather than brand new metrics series volume. (1) would be much easier to implement but I question how actionable an unlablled count of authentication errors is.
That being said, arguably we already have option (2) unless you're running with an anonymous policy that's wide-open. With tight ACLs, any rate metric with the label token:anonymous is an authentication failure.
authorization
Authorization failures are a good bit more complicated, as illustrated by the difference between NodePool.List and NodePool.GetNodePool. In "List" RPCs, authorization is used to filter results (ex node_pool_endpoint.go#L73-L81) and we never return errors. Whereas in "Read/Write" RPCs, authorization failures return an error (ex node_pool_endpoint.go#L134-L136)
Adding authorization failure metrics on the List RPCs doesn't make a lot of sense because every List RPC request is going to filter out a lot of content based on that ACL filter (outside of trivial deployments). But on top of that, some of the more complicated non-List RPCs use filters as part of an initial query and then use yes/no authorization for the final operation.
So to implement authorization metrics, you've got to assess each of the ~150 or so RPC methods and determine which of their authorization calls would get metrics. If we were to implement this part at all, I would definitely want to make sure authorization failures are measured separately from authentication failures.