Logan Attwood comments

Results 55 comments of


                                            Logan Attwood

pending allocations stuck in pending state after adoption by a new deployment

in trying to kick the broken allocs, I've managed to get Nomad to try sending an interrupt to a pending allocation! edit: had to `killall -9 nomad` on ca11.

pending allocations stuck in pending state after adoption by a new deployment

I figured out how to make it worse! If I drain the node and mark it as ineligible, then re-enable eligibility, all of the system jobs end up with additional...

pending allocations stuck in pending state after adoption by a new deployment

Just adding for additional flavour/I found this hilarious-

pending allocations stuck in pending state after adoption by a new deployment

Pending alloc example with logs from the Nomad Agent. Times are all accurate to each other. ``` root@HOSTNAME:~# TZ=America/Halifax journalctl --unit nomad --since '14 hours ago' | grep -v '\(runner\)'...

pending allocations stuck in pending state after adoption by a new deployment

More log spelunking- after this shows up in the logs on an agent/client, no further alloc updates occur, and the drain issue with the pending allocs also starts occuring too....

pending allocations stuck in pending state after adoption by a new deployment

grabbed a goroutine stack dump and found a clue. the same node is blocked here, and was for 1489 minutes, which ends up being just after the "error performing RPC...

pending allocations stuck in pending state after adoption by a new deployment

@jrasell I found the bug, it was in yamux. PR: https://github.com/hashicorp/yamux/pull/127

pending allocations stuck in pending state after adoption by a new deployment

I'm suspecting this bug is caused by the whole pending allocations

pending allocations stuck in pending state after adoption by a new deployment

@jrasell good stuff. i'm likely going to be cutting a 1.7.7-dev or 1.7.8 build with the yamux change and rolling it out on our side today, once i change the...

pending allocations stuck in pending state after adoption by a new deployment

Just added some more bug/correctness fixes to the PR.