Batch jobs using vault fail with error "unable to find token for workload"
Nomad version
Nomad v1.10.1
BuildDate 2025-05-13T07:40:43Z
Revision 3431f13e8036b4716aac0e3b8c5854ddca212e5c
Operating system and Environment details
Debian GNU/Linux 13 (trixie) x86_64, 6.9.9-amd64 Nomad installed through apt: nomad/bookworm,now 1.10.1-1 amd64
Issue
Unable to run any periodic/batch jobs that use Vault due to weird error:
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s
From what I can tell the error comes from here, perhaps it is related to the mentioned race condition? https://github.com/hashicorp/nomad/blob/348177d118a129cdc9196b6f8cb0181caac8b41b/client/widmgr/widmgr.go#L147-L149
This is happening with all the periodic jobs I have and seem to happen regardless of what the job spedcification looks like, as long as it contains a vault section. The issue appeared without me consiously changing anything in Nomad, Vault, or the job specification, as the job just started failing on their own.
Reproduction steps
- Submit the job to nomad
- Force launch a periodic job
- Job should be stuck in either pending or recovering (depending on how long it is left alone) state with the error above in the nomad logs
Expected Result
Job starts and runs without issue.
Actual Result
Started job (and any future instances of the periodic job) are stuck in pending state forever with the mentioned error in the nomad logs.
Job file (if appropriate)
Reproducible with this minimal job:
job "periodic-repro" {
region = "dk"
datacenters = ["dk1"]
type = "batch"
namespace = "monitor"
periodic {
cron = "0 0 * * *"
}
group "periodic-repro" {
count = 1
task "periodic-repro" {
driver = "docker"
config {
image = "busybox:latest"
network_mode = "host"
command = "sh"
args = ["-c", "echo 'Hello World' && sleep 120"]
}
resources {
cpu = 100
memory = 256
}
vault {
policies = ["monitor-nrgi"]
change_mode = "restart"
}
}
}
}
Nomad Server logs (if appropriate)
These are the logs from my nomad server which is also the client the job runs on:
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s
I just tried running the job as normal job instead of a batch job and it gets the same error, so the issue does not seem to be because it is a batch job.
There is nothing in the logs from vault when the job start either.
Heya @Zarickan, thanks for the report! That error does only seem to come from that one spot. We'll get it prioritized and take a look!
I also had this issue after upgrading to v1.10.2. Downgrading to 1.9.3 solved the issue.
Followed @SamuelM333 - had the same issue after upgrading to v1.10.2. Downgrading to 1.9.3 solved the issue.
I had the samae issue.
Can confirm this issue is still present on v1.10.5. Sadly this makes the Vault integration not usable under 1.10.
I'm running into this with a job that contains two tasks, each with their own vault roles & policies. I have several other batch jobs with vault integration that are working just fine, but they all have a singular group & task.