nomad
nomad copied to clipboard
Invalid Vault token (403) in a Nomad client after recycling Nomad servers
Nomad version
v1.5.15+ent
Operating system and Environment details
Ubuntu 22.04 - AWS EC2 instances
Issue
It looks like we've hit a bug where a nomad client starts receiving 403s from Vault when we're in the middle of recycling the nomad servers (3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one). This has happened twice already in our Production systems recently.
Reproduction steps
- Cluster with 3 servers and multiple nomad clients with Vault integration enabled - NOTE: only 1 client is affected (in a pool of 84 clients)
- Add 3 extra server nodes to the cluster
- immediately after the new nodes join (and one of them is promoted automatically via autopilot), the client misses a heartbeat with a timeout: client logs:
[Oct 09, 2024 at 9:43:56.188 pm]
client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" rpc=Node.UpdateStatus server=10.181.3.215:4647
client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647
client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647
client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" period=1.617448288s
client.consul: discovered following servers: servers=[10.181.3.134:4647, 10.181.3.215:4647, 10.181.2.84:4647, 10.181.1.241:4647, 10.181.1.177:4647, 10.181.2.12:4647]
client: missed heartbeat: req_latency=21.613428ms heartbeat_ttl=16.683772489s since_last_heartbeat=26.713400803s
agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Error making API request.
URL: GET https://vault.service.consul:8200/v1/secret/data/service/xx/yy
Code: 403. Errors:
* permission denied (retry attempt 1 after "250ms")
servers:
[Oct 09, 2024 at 9:43:47.731 pm] nomad.autopilot: Promoting server: id=a0498eba-bc93-76d2-be12-5477c3db9dfe address=10.181.3.215:4647 name=nomad-server-10-181-3-215.global
[Oct 09, 2024 at 9:43:56.227 pm] nomad.heartbeat: node TTL expired: node_id=a05735cd-8fa4-28bf-99cf-d160f6f73922
-
the "Promoting server" message I don't think means leader election since the rest of the logs indicate that other node acquires leadership later in the recycling process (5min later)
-
After that, the client will be rejected by Vault for all requests with 403s for 8+ minutes (so, even after the re-election has happened)
-
New servers finish registering in Consul
-
after the 3 old servers have left the cluster, the client no longer receives 403s from Vault.
Expected Result
Client should continue to operate normally when rolling nomad servers
Actual Result
Client is interrupted and receives 403s from Vault