nomad
nomad copied to clipboard
Client Agents Improperly Use AllowStale=true with ACL.ResolveToken
Nomad version
v1.4.3 and earlier
Issue
Client agents resolve ACL tokens via the ACL.ResolveToken RPC but improperly always set AllowStale=true.
If the Client is connected to a Server that is in the process of restoring a snapshot either locally or from a remote Server, this could lead to returning a 404 Not Found for otherwise valid token.
In the case of a slow or malfunctioning Server responding to the RPC, an arbitrarily old ACL token may be returned even if a later Raft log deletes it.
Caching negative results
The ACL token cache also caches negative results. While the cache is of fixed size to prevent DoS via OOM, filling the cache with negative results would full negate caching.
Fix
- ACL.ResolveToken should only be called with AllowStale=true
- ACL.ResolveToken should allow batches of tokens to be looked up in a single request.
- Cache misses should be batched on the Client for a short period of time (100ms?) or size of batch (100 tokens?)
- ACL.ResolveToken calls should be serialized to effectively rate limit them to 1 request per millisecond (if the numbers above are used)
- Negative results should not be cached
While this does not prevent all DoS vectors, it does limit the amount of work caused by negative lookups. It also means a token is available immediately upon being written. Previously Client token resolution may fail to find newly created tokens and cache the token as invalid for 30 seconds.
The default for caching tokens could probably be larger for tokens with an expiration set. I don't think we would want to cache for the entire expiration period as tokens may also be manually revoked. However since now we would query the leader and wouldn't cache negative results, a default of more than 30s might be appropriate.