nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Batch jobs using vault fail with error "unable to find token for workload"

Open Zarickan opened this issue 9 months ago • 7 comments

Nomad version

Nomad v1.10.1
BuildDate 2025-05-13T07:40:43Z
Revision 3431f13e8036b4716aac0e3b8c5854ddca212e5c

Operating system and Environment details

Debian GNU/Linux 13 (trixie) x86_64, 6.9.9-amd64 Nomad installed through apt: nomad/bookworm,now 1.10.1-1 amd64

Issue

Unable to run any periodic/batch jobs that use Vault due to weird error:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s

From what I can tell the error comes from here, perhaps it is related to the mentioned race condition? https://github.com/hashicorp/nomad/blob/348177d118a129cdc9196b6f8cb0181caac8b41b/client/widmgr/widmgr.go#L147-L149

This is happening with all the periodic jobs I have and seem to happen regardless of what the job spedcification looks like, as long as it contains a vault section. The issue appeared without me consiously changing anything in Nomad, Vault, or the job specification, as the job just started failing on their own.

Reproduction steps

  1. Submit the job to nomad
  2. Force launch a periodic job
  3. Job should be stuck in either pending or recovering (depending on how long it is left alone) state with the error above in the nomad logs

Expected Result

Job starts and runs without issue.

Actual Result

Started job (and any future instances of the periodic job) are stuck in pending state forever with the mentioned error in the nomad logs.

Job file (if appropriate)

Reproducible with this minimal job:

job "periodic-repro" {
  region      = "dk"
  datacenters = ["dk1"]
  type        = "batch"
  namespace   = "monitor"

  periodic {
    cron             = "0 0 * * *"
  }

  group "periodic-repro" {
    count = 1

    task "periodic-repro" {
      driver = "docker"

      config {
        image        = "busybox:latest"
        network_mode = "host"
        command      = "sh"
        args         = ["-c", "echo 'Hello World' && sleep 120"]
      }

      resources {
        cpu    = 100
        memory = 256
      }

      vault {
        policies    = ["monitor-nrgi"]
        change_mode = "restart"
      }
    }
  }
}

Nomad Server logs (if appropriate)

These are the logs from my nomad server which is also the client the job runs on:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s

Zarickan avatar Jun 01 '25 09:06 Zarickan

I just tried running the job as normal job instead of a batch job and it gets the same error, so the issue does not seem to be because it is a batch job.

There is nothing in the logs from vault when the job start either.

Zarickan avatar Jun 01 '25 09:06 Zarickan

Heya @Zarickan, thanks for the report! That error does only seem to come from that one spot. We'll get it prioritized and take a look!

gulducat avatar Jun 03 '25 19:06 gulducat

I also had this issue after upgrading to v1.10.2. Downgrading to 1.9.3 solved the issue.

SamuelM333 avatar Jul 10 '25 17:07 SamuelM333

Followed @SamuelM333 - had the same issue after upgrading to v1.10.2. Downgrading to 1.9.3 solved the issue.

sumansnorkell avatar Jul 12 '25 00:07 sumansnorkell

I had the samae issue.

sumansaurabh avatar Jul 24 '25 06:07 sumansaurabh

Can confirm this issue is still present on v1.10.5. Sadly this makes the Vault integration not usable under 1.10.

SamuelM333 avatar Sep 28 '25 20:09 SamuelM333

I'm running into this with a job that contains two tasks, each with their own vault roles & policies. I have several other batch jobs with vault integration that are working just fine, but they all have a singular group & task.

dcarbone avatar Dec 07 '25 16:12 dcarbone