nomad icon indicating copy to clipboard operation
nomad copied to clipboard

System job not restarting after client failure.

Open blmhemu opened this issue 3 years ago • 0 comments

Nomad version

1.4.2

Operating system and Environment details

Ubuntu arm64

Issue

If the client is down, system job on that client is not restarted unless manually done. Screenshot 2022-10-28 at 1 51 09 PM Also the job status in client shows 2 failed. It should be 1 failed 1 passed because as you can see below, there is one job running.

Reproduction steps

Run a system job. Take the client (or the whole cluster ?) down. Bring the nodes up. Check if the system job has all allocations.

Expected Result

All allocations present.

Actual Result

Not all allocations present.

Job file (if appropriate)

Same as https://github.com/hashicorp/nomad/issues/14932

Nomad Server logs (if appropriate)

Could see the alloc was killed due to

Template failed: nomad.var.get(nomad/jobs/caddy/caddy/[email protected]): Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad Client logs (if appropriate)

blmhemu avatar Oct 28 '22 08:10 blmhemu