nomad
nomad copied to clipboard
System job not restarting after client failure.
Nomad version
1.4.2
Operating system and Environment details
Ubuntu arm64
Issue
If the client is down, system job on that client is not restarted unless manually done.
Also the job status in client shows 2 failed. It should be 1 failed 1 passed because as you can see below, there is one job running.
Reproduction steps
Run a system job. Take the client (or the whole cluster ?) down. Bring the nodes up. Check if the system job has all allocations.
Expected Result
All allocations present.
Actual Result
Not all allocations present.
Job file (if appropriate)
Same as https://github.com/hashicorp/nomad/issues/14932
Nomad Server logs (if appropriate)
Could see the alloc was killed due to
Template failed: nomad.var.get(nomad/jobs/caddy/caddy/[email protected]): Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)