nomad icon indicating copy to clipboard operation
nomad copied to clipboard

pending allocations stuck in pending state after adoption by a new deployment

Open lattwood opened this issue 1 year ago • 19 comments

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

All Linux

Issue

  • We have a large Nomad cluster- > 2k agents.
  • We run health checks on each agent and do dynamic metadata updates.
  • We have a large job that gets scheduled on 99% of the healthy agents.
  • We use a shell script to query nomad for node counts matching a filter on relevant dynamic metadata & run nomad job scale JOB TG COUNT when it changes.
  • We have a way to identify if the currently running deployment is due to a new Docker Image, or due to a scaling activity.
  • If it's due to a scaling activity on another task group, we cancel that deployment, and issue a new nomad job scale command immediately afterwards.
    • Due to how frequently the dynamic metadata is changing, we are constantly doing deployments.
    • We kicked off 250 deployments in the last hour.
  • Existing allocations usually all get adopted by the new deployment.
  • The issues:
    • most of the time, but not always, the version on the existing allocs that are still "Pending" at deployment cancelation are updated. but we end up with allocations that don't get the version updated.
    • sometimes the allocation exists, but isn't assigned to a client in the UI. if you inspect the alloc with the Nomad CLI, it tells you which client it has been assigned to.
    • allocations get stuck in the pending state. we've found sending a SIGQUIT to the Nomad Agent fixes this, but that doesn't scale to > 2k nodes.
    • allocations will get rescheduled before they ever leave the pending state. YEAH.
    • possibly related, but might not be- if a node heartbeat is missed and reregistered, we can end up with stuck pending jobs on a node as well. This is also fixed with a SIGQUIT.
      • The stuck jobs aspect is brutal, since that doesn't start the health/deployment timers as far as I can tell.
      • The fact restarting the nomad agent is a little unnerving. I vaguely recall seeing something in the logs about "allocations out of date" or something like that when this issue has manifested itself in the past.

Reproduction steps

  • run a big cluster of nodes with intermittent connectivity issues to the master.
  • run a big long running job on said cluster
  • constantly be deploying/scaling the jobspec
  • wait

Expected Result

  • nomad schedules and runs jobs

Actual Result

  • nomad doesn't

Screenshots of Absurdity

In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.

Additionally, you can see the Modified time is the same for ones staying up to date, but isn't changing on other ones- you can also see the Created times are all over the place.

Screenshot 2024-06-11 at 3 35 54 PM

In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.

Screenshot 2024-06-11 at 4 52 00 PM

Reporter's speculation

Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.

Nomad logs

Useless, unfortunately, due to this bug: https://github.com/hashicorp/nomad/issues/22431

lattwood avatar Jun 11 '24 20:06 lattwood

Screenshot 2024-06-11 at 5 44 59 PM

in trying to kick the broken allocs, I've managed to get Nomad to try sending an interrupt to a pending allocation!

edit: had to killall -9 nomad on ca11.

lattwood avatar Jun 11 '24 20:06 lattwood

I figured out how to make it worse! If I drain the node and mark it as ineligible, then re-enable eligibility, all of the system jobs end up with additional allocations. The only solution here is restarting/killing the Nomad agent.

Screenshot 2024-06-11 at 6 09 06 PM

lattwood avatar Jun 11 '24 21:06 lattwood

Just adding for additional flavour/I found this hilarious-

image

lattwood avatar Jun 11 '24 21:06 lattwood

Pending alloc example with logs from the Nomad Agent. Times are all accurate to each other.

Screenshot 2024-06-12 at 10 54 36 AM
root@HOSTNAME:~# TZ=America/Halifax journalctl --unit nomad --since '14 hours ago' | grep -v '\(runner\)'
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"received EOF, stopping recv loop","@module":"client.driver_mgr.docker.docker_logger.stdio","@timestamp":"2024-06-12T02:28:35.213106Z","driver":"docker","err":"rpc error: code = Unavailable desc = error reading from server: EOF"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.224019Z","driver":"docker","id":"2362096","plugin":"/usr/local/bin/nomad"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.225535Z","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.430987Z","driver":"docker","image_id":"sha256:31d1924d3c7e119a450ce9d0d2fbd417a1aba168781625acbc7bf77b398cee33","references":0}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"(watcher) stopping all views","@module":"agent","@timestamp":"2024-06-12T02:28:35.472850Z"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"stopped container","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.610787Z","container_id":"02ae65871d36198b058a7b98387c5b40bc857e7dfc8eb6bd6de60407e6f28305","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"received EOF, stopping recv loop","@module":"client.driver_mgr.docker.docker_logger.stdio","@timestamp":"2024-06-12T02:28:35.626094Z","driver":"docker","err":"rpc error: code = Unavailable desc = error reading from server: EOF"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"error","@message":"error encountered while scanning stdout","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.631515Z","driver":"docker","error":"read |0: file already closed"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.631600Z","driver":"docker","id":"2361837","plugin":"/usr/local/bin/nomad"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.632337Z","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.654011Z","driver":"docker","image_id":"sha256:6331aa0368327f0829316ca10b45844eb7303caf5f63cc979063df0ef7ffca01","references":0}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-06-12T02:28:35.669645Z","alloc_id":"221a2ab1-43ea-22bb-1616-24d9ce047609"}
Jun 11 23:31:40 HOSTNAME nomad[3215]: {"@level":"debug","@message":"cleanup removed downloaded image","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:31:40.406819Z","driver":"docker","image_id":"sha256:31d1924d3c7e119a450ce9d0d2fbd417a1aba168781625acbc7bf77b398cee33"}
Jun 11 23:31:41 HOSTNAME nomad[3215]: {"@level":"debug","@message":"cleanup removed downloaded image","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:31:41.201500Z","driver":"docker","image_id":"sha256:6331aa0368327f0829316ca10b45844eb7303caf5f63cc979063df0ef7ffca01"}

From what I can tell, all the logs from ~23:38:35 are due to the failed allocation, and there is nothing in the agent's logs about the pending allocation.

Generally speaking, forcibly killing Nomad (we have draining enabled) tends to fix these issues, but these state inconsistency bugs are terrifying.

lattwood avatar Jun 12 '24 13:06 lattwood

Hi @lattwood and thanks for raising this issue with the detail included. I'll add this to our backlog to investigate further. In the meantime, if you discover any other information or concrete reproduction, please let us know.

jrasell avatar Jun 12 '24 14:06 jrasell

More log spelunking- after this shows up in the logs on an agent/client, no further alloc updates occur, and the drain issue with the pending allocs also starts occuring too.

Jun 11 14:50:29 HOSTNAME nomad[2835]: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2024-06-11T14:50:29.411617Z","added":0,"errors":0,"ignored":4,"removed":0,"updated":0}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"yamux: keepalive failed: i/o deadline reached","@module":"client","@timestamp":"2024-06-11T14:50:50.303589Z"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"yamux: Failed to read stream data: read tcp [IPv6_ADDRESS_1]:42188-\u003e[IPv6_ADDRESS_2]:4647: use of closed network connection","@module":"client","@timestamp":"2024-06-11T14:50:50.303834Z"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.304920Z","error":"rpc error: EOF","rpc":"Node.UpdateAlloc","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.305307Z","error":"rpc error: EOF","rpc":"Node.UpdateAlloc","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error updating allocations","@module":"client","@timestamp":"2024-06-11T14:50:50.305578Z","error":"rpc error: EOF"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.305401Z","error":"rpc error: EOF","rpc":"Alloc.GetAllocs","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}

lattwood avatar Jun 12 '24 15:06 lattwood

grabbed a goroutine stack dump and found a clue.

the same node is blocked here, and was for 1489 minutes, which ends up being just after the "error performing RPC to server" stuff in the above comment: https://github.com/hashicorp/yamux/blob/6c5a7317d6e3b6e7f85db696d6c83ed353e7cb4c/stream.go#L145-L153

Maybe timeouts aren't getting set on retries? idk.

edit: the deserialization methods for a jobspec are also evident in the stack trace.

Jun 12 15:40:35 HOSTNAME nomad[2835]: goroutine 254 gp=0xc0006888c0 m=nil [select, 1489 minutes]:
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.gopark(0xc0024a8d00?, 0x2?, 0x25?, 0xa9?, 0xc0024a8c94?)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/proc.go:402 +0xce fp=0xc0024a8b38 sp=0xc0024a8b18 pc=0x44630e
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.selectgo(0xc0024a8d00, 0xc0024a8c90, 0xc003ed46f0?, 0x0, 0xc0024a8cd8?, 0x1)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/select.go:327 +0x725 fp=0xc0024a8c58 sp=0xc0024a8b38 pc=0x458385
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/yamux.(*Stream).Read(0xc0032fa000, {0xc002cc1000, 0x1000, 0xc001cc0b00?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/[email protected]/stream.go:145 +0x40e fp=0xc0024a8d50 sp=0xc0024a8c58 pc=0xdb2ace
Jun 12 15:40:35 HOSTNAME nomad[2835]: bufio.(*Reader).Read(0xc002081e00, {0xc001cc0b00, 0xb, 0xdf9db0?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         bufio/bufio.go:241 +0x197 fp=0xc0024a8d88 sp=0xc0024a8d50 pc=0x6ca857
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.decReadFull({0x3970800, 0xc002081e00}, {0xc001cc0b00, 0xb, 0x40})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/[email protected]/codec/decode.go:3086 +0xa3 fp=0xc0024a8dd8 sp=0xc0024a8d88 pc=0xdfd303
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readb(0xc0035bde00, {0xc001cc0b00, 0xb, 0x7ed07119d5b8?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/[email protected]/codec/decode.go:422 +0x4d fp=0xc0024a8e28 sp=0xc0024a8dd8 pc=0xdf1aed
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.(*decReaderSwitch).readbIO(0xc002081e00?, {0xc001cc0b00?, 0x0?, 0x17?})
...
elided for brevity
...
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).rpc(0xc00091f888, {0x3112c54, 0xf}, {0x2f9c6c0, 0xc000485d90}, {0x2c6b860, 0xc0008f1860})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:104 +0x23e fp=0xc0024a9988 sp=0xc0024a9818 pc=0x1dcdb9e
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).RPC(...)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:61
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).rpc(0xc00091f888, {0x3112c54, 0xf}, {0x2f9c6c0, 0xc000485d90}, {0x2c6b860, 0xc0008f1860})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:154 +0xb2d fp=0xc0024a9af8 sp=0xc0024a9988 pc=0x1dce48d
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).RPC(...)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:61
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).watchAllocations(0xc00091f888, 0xc0008f1740)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:2395 +0x96e fp=0xc0024a9fc0 sp=0xc0024a9af8 pc=0x1dacf6e
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).run.gowrap1()
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:1850 +0x25 fp=0xc0024a9fe0 sp=0xc0024a9fc0 pc=0x1da9d65
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.goexit({})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc0024a9fe8 sp=0xc0024a9fe0 pc=0x47ebe1
Jun 12 15:40:35 HOSTNAME nomad[2835]: created by github.com/hashicorp/nomad/client.(*Client).run in goroutine 247
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:1850 +0x86

lattwood avatar Jun 12 '24 16:06 lattwood

@jrasell I found the bug, it was in yamux. PR: https://github.com/hashicorp/yamux/pull/127

lattwood avatar Jun 13 '24 17:06 lattwood

I'm suspecting this bug is caused by the whole pending allocations

lattwood avatar Jun 13 '24 18:06 lattwood

Hi @lattwood, amazing work and thanks so much for digging into this and raising a PR. I'll poke around internally and see who I need to get a reviewers on the PR.

jrasell avatar Jun 14 '24 06:06 jrasell

@jrasell good stuff. i'm likely going to be cutting a 1.7.7-dev or 1.7.8 build with the yamux change and rolling it out on our side today, once i change the return to a goto.

lattwood avatar Jun 14 '24 12:06 lattwood

Just added some more bug/correctness fixes to the PR.

lattwood avatar Jun 14 '24 14:06 lattwood

https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork

I threw a build up there with the yamux fix.

lattwood avatar Jun 17 '24 17:06 lattwood

Confirmed that hashicorp/yamux#127 fixes the issue.

We'll be running a forked Nomad build until there's a release of Nomad with the yamux fix in it.

edit: idk if I'm out of line for saying this, but the yamux fix probably should be backported everywhere it's in use 😅

lattwood avatar Jun 20 '24 15:06 lattwood

Did you have TLS configured for Nomad's RPC?

schmichael avatar Aug 14 '24 20:08 schmichael

@schmichael yup. (our nomad agents are distributed around the world)

lattwood avatar Aug 14 '24 20:08 lattwood

Just saw another issue opened on yamux this week that could be responsible for what we're seeing here as well- https://github.com/hashicorp/yamux/issues/133

lattwood avatar Aug 21 '24 14:08 lattwood

hmm, i don't think so? heh.

On Thu, Sep 5, 2024 at 3:05 PM Michael Schurter @.***> wrote:

Closed #23305 https://github.com/hashicorp/nomad/issues/23305 as completed via hashicorp/yamux#127 https://github.com/hashicorp/yamux/pull/127.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/23305#event-14151232453, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKA4TPGTSASFDO6LNHDZVCMPDAVCNFSM6AAAAABJFAB6ASVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGE2TCMRTGI2DKMY . You are receiving this because you were mentioned.Message ID: @.***>

lattwood avatar Sep 05 '24 18:09 lattwood

hmm, i don't think so? heh.

:facepalm: Sorry. Evidently Github treats "Should fix #..." as meaning "Absolutely does fix #..." Reopening.

schmichael avatar Sep 05 '24 18:09 schmichael

In the context of this- https://github.com/hashicorp/nomad/issues/23848

would it be worthwhile for us to try dropping in the libp2p yamux fork

lattwood avatar Sep 09 '24 14:09 lattwood

@schmichael

I think the hung Read is a symptom, not the cause. I think the cause is a timed out write that the keepalive interval on the underlying session.

https://github.com/hashicorp/yamux/issues/142

lattwood avatar Jan 04 '25 19:01 lattwood

Fix opened- hashicorp/yamux#143

lattwood avatar Jan 14 '25 18:01 lattwood

current diff I've compiled into a nomad binary and will be deploying shortly- https://github.com/lattwood/net-rpc-msgpackrpc/compare/master...lattwood:net-rpc-msgpackrpc:close_on_server_write_error

lattwood avatar Feb 06 '25 19:02 lattwood

Keeping everyone updated- I've cut a custom binary and deployed it, branching from 1.9.3 with the fix.

lattwood avatar Feb 06 '25 21:02 lattwood

Writing PRs! It works and solved the issue!

Also found three other issues I'll elaborate on here once those two PRs are open.

Two of the issues made me think we didn't fix it- one of those issues is in Nomad/Consul Template, the other in Vault.

The other issue is a correctness issue and involves both yamux and net-rpc-msgpackrpc.

lattwood avatar Feb 07 '25 17:02 lattwood

Discovered issues, only one of which I feel a pressing need to open an issue for (number 2)

Issue 1

When we were rolling the new binary out yesterday, we had Vault somehow drop one out of > 10k lease renewals from when Nomad adopted the existing running containers. We have audit log entries for the lease renewals, and a lease that was supposedly renewed last night with a TTL of 3 days expired this morning.

I am not filing a bug report for this for two reasons: it was one out of > 10k, and we're currently running Vault 1.17.2 which is not the latest OSS version.

Issue 2 (related to above)

When the above happened, the consul-template integrated into Nomad did the dumbest thing possible. It kept trying to re-use the expired token. (this also gave me a mini heartattack this morning because when this happens, the allocation stays in the Pending state until Nomad successfully does the template thing- iow, it briefly made me think the msgpack issue wasn't resolved)

(view) vault.read(aws/sts/REDACTED): vault.read(aws/sts/REDACTED): Error making API request.

URL: GET https://REDACTED:8200/v1/aws/sts/REDACTED
Code: 403. Errors:

* 2 errors occurred:
	* permission denied
	* invalid token

 (retry attempt 492 after "1m0s")

I can't think of any reason to retry that many times when Vault is telling you the token is bad, and I can't think of any reason you should be able to configure it to do this. Preferably consul-template should fail the allocation so Nomad creates a new allocation that gets a fresh token. This was on a system job that wasn't being updated.

I would like to open an issue for this one if there isn't one already, just not sure if it should be on Nomad or Consul-Template.

Issue 3

When I was making the change in hashicorp/net-rpc-msgpackrpc#15 I noticed that MsgpackCodec has a mutex for writing, but Go's gobServerCodec only has one for generating sequence numbers, and same for jsonrpc. Client codec has one for sending things out on the wire too, but it's mostly about preserving internal state rather than mediating access to the connection.

The godoc for func net/rpc.NetClient(conn io.ReadWriteCloser) *Client says each half may be accessed concurrently so the implementation of conn should protect against concurrent reads or concurrent writes. As a part of digging into this issue, we discovered yamux is not safe for concurrent reads.

So, I think the mutex in MsgpackCodec exists for the purposes of providing the safe concurrent access guarantee, working around yamux's lack of concurrency safety.

It would be dope to get that fixed, but that involves rewriting yamux to queue pending ops I think, instead of having them all block and randomly be awoken when data comes in for a given connection, and that's not exactly a small lift. (I'd love to open an issue for this but I'm very realistic about the prospects of it getting fixed)

lattwood avatar Feb 07 '25 18:02 lattwood

When we were rolling the new binary out yesterday, we had Vault somehow drop one out of > 10k lease renewals from when Nomad adopted the existing running containers. We have audit log entries for the lease renewals, and a lease that was supposedly renewed last night with a TTL of 3 days expired this morning.

Weird, for sure. Could be worth hunting down if we had logs, but if it's the legacy token workflow and not WI, we're unlikely to sink much time into it.

When the above happened, the consul-template integrated into Nomad did the dumbest thing possible. It kept trying to re-use the expired token.

That smells like an issue where the Vault error is being treated as non-fatal (ex. https://github.com/hashicorp/nomad/pull/24409 and https://github.com/hashicorp/nomad/pull/25040), except that this error message you're getting is internal to CT. What's you're vault_retry configuration look like? If you don't have retries set to unlimited, then that's likely a CT bug.

So, I think the mutex in MsgpackCodec exists for the purposes of providing the safe concurrent access guarantee, working around yamux's lack of concurrency safety. ... It would be dope to get that fixed,

I'm not sure I understand what you're proposing is the issue here. Yamux streams are serviced by a single goroutine in the caller.

tgross avatar Feb 07 '25 20:02 tgross

vault_retry isn't set on the template blocks in the job spec.

Issue 3 was more of a smell I think and after reading through more code, any of the potential issues are dealt with by the fact yamux handles the stream separation.

lattwood avatar Feb 07 '25 21:02 lattwood

I can say with certainty, that hashicorp/net-rpc-msgpackrpc#15 fixes the main issue at hand.

Any chance of this landing in a 1.9.x release so getting this fixed isn't tied to migration from Vault Tokens to Workload Identity?

lattwood avatar Feb 10 '25 14:02 lattwood

vault_retry isn't set on the template blocks in the job spec.

It's a client agent configuration value. But if unset, there's a reasonable default for it (attempts = 12 backoff = "250ms" max_backoff = "1m"). You can find the final configuration we pass to CT in a log like:

2025-02-10T16:24:29.990-0500 [DEBUG] agent: (runner) final config: [[giant blob of single-line JSON]]

Issue 3 was more of a smell I think and after reading through more code, any of the potential issues are dealt with by the fact yamux handles the stream separation.

👍

I can say with certainty, that https://github.com/hashicorp/net-rpc-msgpackrpc/pull/15 fixes the main issue at hand.

Great to hear!

Any chance of this landing in a 1.9.x release so getting this fixed isn't tied to migration from Vault Tokens to Workload Identity?

Yes! It's missed the boat on the final testing for the release we expect to ship tomorrow, but it'll land in the next one for sure. The PR that pulls the dependency in will close this issue.

tgross avatar Feb 10 '25 21:02 tgross