nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Invalid Vault token (403) in a Nomad client after recycling Nomad servers

Open adrianlop opened this issue 4 months ago • 7 comments

Nomad version

v1.5.15+ent

Operating system and Environment details

Ubuntu 22.04 - AWS EC2 instances

Issue

It looks like we've hit a bug where a nomad client starts receiving 403s from Vault when we're in the middle of recycling the nomad servers (3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one). This has happened twice already in our Production systems recently.

Reproduction steps

  • Cluster with 3 servers and multiple nomad clients with Vault integration enabled - NOTE: only 1 client is affected (in a pool of 84 clients)
  • Add 3 extra server nodes to the cluster
  • immediately after the new nodes join (and one of them is promoted automatically via autopilot), the client misses a heartbeat with a timeout: client logs:
[Oct 09, 2024 at 9:43:56.188 pm]

client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" rpc=Node.UpdateStatus server=10.181.3.215:4647

client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647
client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647

client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" period=1.617448288s

client.consul: discovered following servers: servers=[10.181.3.134:4647, 10.181.3.215:4647, 10.181.2.84:4647, 10.181.1.241:4647, 10.181.1.177:4647, 10.181.2.12:4647]

client: missed heartbeat: req_latency=21.613428ms heartbeat_ttl=16.683772489s since_last_heartbeat=26.713400803s

agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Error making API request.

URL: GET https://vault.service.consul:8200/v1/secret/data/service/xx/yy
Code: 403. Errors:

* permission denied (retry attempt 1 after "250ms")

servers:

[Oct 09, 2024 at 9:43:47.731 pm] nomad.autopilot: Promoting server: id=a0498eba-bc93-76d2-be12-5477c3db9dfe address=10.181.3.215:4647 name=nomad-server-10-181-3-215.global

[Oct 09, 2024 at 9:43:56.227 pm] nomad.heartbeat: node TTL expired: node_id=a05735cd-8fa4-28bf-99cf-d160f6f73922
  • the "Promoting server" message I don't think means leader election since the rest of the logs indicate that other node acquires leadership later in the recycling process (5min later)

  • After that, the client will be rejected by Vault for all requests with 403s for 8+ minutes (so, even after the re-election has happened)

  • New servers finish registering in Consul

  • after the 3 old servers have left the cluster, the client no longer receives 403s from Vault.

Expected Result

Client should continue to operate normally when rolling nomad servers

Actual Result

Client is interrupted and receives 403s from Vault

adrianlop avatar Oct 18 '24 12:10 adrianlop