allow force rescheduling of failed alloc with no reschedule.attempts remaining
Nomad version
Verified the same behavior in 1.9.5, 1.10.5, 1.11.1
Nomad v1.11.1
BuildDate 2025-12-09T20:10:56Z
Revision 5b76eb0535615e32faf4daee479f7155ea16ec0d
Operating system and Environment details
Fedora 40
Issue
The job definition has following block:
reschedule {
attempts = 0
unlimited = false
}
restart {
attempts = 0
mode = "fail"
}
This is done because in our use case we cannot allow application to restart automatically upon failure. It should remain down until the failure can be investigated. However this also has a consequence that trying to submit the same job spec to Nomad results in it not starting it. The logs that we see on the Nomad side for this is:
2025-12-11T04:55:38.388Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=40c0c62d-f2dd-69b9-742a-3a7d8609bdfd task=prestart reason="Policy allows no restarts"
And the app remains in failed state. To start it again we have to 1st purge the job from Nomad then submit it again.
Reproduction steps
job "test-bug" {
region = "crypto"
type = "service"
group "grp" {
reschedule {
attempts = 0
unlimited = false
}
restart {
attempts = 0
mode = "fail"
}
task "main" {
driver = "raw_exec"
kill_timeout = "180s"
kill_signal = "SIGCONT" # Just for the sake of having it defined, not used in repro
shutdown_delay = "60s"
config {
command = "sleep"
args = [
"infinity"
]
}
}
}
}
- The job with above block to prevent automated restarts
- Send it to failed state by
kill -9 PIDon the host it is running - Try to start it again by submitting the same job spec (via API)
Expected Result
The job started when the same spec is submitted, because it is a user action, so it is intentional.
Actual Result
The job is not started with the error log "Policy allows no restarts"
Hi @akamensky! Because you have reschedules disabled as well, submitting the exact same jobspec is intentionally a no-op.
There should be a workaround here, which is to stop the job without purging it, which increments the job version number, and then reverting the job via the Job Revert API to the failed version. Reverting the job should copy it to a new version but because the old allocation that matches the same task definition is sitting around, the scheduler isn't detecting this as an update.
If you GC the allocation, you should be able to get a new one with that workaround, but that obviously is only better than nomad job stop -purge if you have multiple allocations for the job. And we don't give you fine-grained control like "please just GC this one allocation". Another common way to work around this is to have a meta on the jobspec that you can increment to bump the version, but that's not very nice either.
I know there have been requests roughly similar to this in the past to the tune of "no really please" reschedule, which would strip the reschedule tracker off an allocation and allow it to be rescheduled, but I think folks have worked around it with the options above so it hasn't previously been prioritized.
This is working as intended but I agree it isn't a very good UX. I'm going to re-title it as an enhancement and mark it for roadmapping.
Reproduction
If I run this jobspec and kill the task, I see the following events from the original failed allocation:
Recent Events:
Time Type Description
2025-12-15T16:16:47-05:00 Not Restarting Policy allows no restarts
2025-12-15T16:16:47-05:00 Terminated Exit Code: 137, Signal: 9
2025-12-15T16:16:29-05:00 Started Task started by client
2025-12-15T16:16:29-05:00 Task Setup Building Task Directory
2025-12-15T16:16:29-05:00 Received Task received by client
If I submit the same jobspec, the eval won't return any placements, so I should see something like this:
$ nomad eval list
ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures
9253fa0a 50 job-register example default <none> complete false
d2220b60 50 deployment-watcher example default <none> complete false
e12f0202 50 job-register example default <none> complete false
Here e12f0202 is the original job registration, and d2220b60 is the deployment wrapping up. For 9253fa0a, we see the following:
$ nomad eval status 9253fa0a
ID = 9253fa0a
Create Time = 12s ago
Modify Time = 12s ago
Status = complete
Status Description = complete
Type = service
TriggeredBy = job-register
Job ID = example
Namespace = default
Priority = 50
Placement Failures = false
Previous Eval = <none>
Next Eval = <none>
Blocked Eval = <none>
Plan Annotations
Task Group Ignore Place Stop InPlace Destructive
group 1 0 0 0 0
Commands like nomad job restart work on live allocations by sending the Restart Allocation or Stop Allocation APIs, but neither of those help you at the point that an allocation is terminal.
Even something like the following won't overcome the fact that the job is terminal:
$ echo '{"JobID": "example", "EvalOptions": {"ForceReschedule": true}}' | nomad operator api -X POST "/v1/job/example/evaluate"
@tgross Thanks for a detailed explanation. I would like to clarify the reasoning for the rescheduling being disabled in the example jobs. The brief version is that in our use case we should not allow application to restart automatically under any circumstances. If it experienced an issue it should remain failed until we investigated and want to bring it back up. Both reschedule and restart configs would restart the application if it crashed (with some differences between them, but the gist is they both try to automatically recover the job). Therefore in all our jobs we have to disable both.
That being said I agree with your statement that it is not a very good UX. I think the problem can be looked from a different angle as well - there should be a way to tell Nomad to not attempt to automatically recover jobs, while being able to control the jobs proactively. There is a difference between unattended and active operations.
The jobs can fail due to external factors - misbehaving dependency, bad data etc. In our setup this is the most common occurrence. In this case the job requires no changes. Restarting it automatically can be a problem because it can hide the issue (for example if the application still runs for some time before failure the monitoring will see the job as alive), it may also end up with bad data downstream if it brought back up again. Hence automatically recovering is not a good solution in those cases. Meantime once investigated and resolved we would need to be able start it in the same form as before again. However as it stands currently we lose the control of the job once it goes into failed state in Nomad. Therefore while for you knowing all internals of Nomad this may seem as an enhancement, to the users of the product this is a bug.
I do have a workaround in mind. But having to do workarounds to achieve a fairly logical behaviour of the system is not great.