aztk icon indicating copy to clipboard operation
aztk copied to clipboard

Nodes never recover from "unusable" state

Open jaley opened this issue 7 years ago • 1 comments

In long-running jobs, I often see low priority nodes pre-empted, then return later, as expected. Usually, they come back up fine as a fresh node, but it's not uncommon to see nodes get stuck in the "Starting" state for an extended period, then finally switch to "Unusable", where they'll remain until they are pre-empted again or shutdown at the end of the job.

Once nodes are in the unusable state, it's impossible to debug what's going on, as no logs are accessible and no console features to interact with the node will work. It's not even possible to trigger a manual reboot.

jaley avatar May 13 '18 12:05 jaley

Just wanted to provide you with an update here. Preemption is expected, however, going from starting straight to unusable. Unusable is a worst-case scenario state for the Batch service. This means that the service has no way of communicating with the node. So debugging in this scenario is essentially not possible client-side.

I'm engaging the service team to see if this is a known issue. I will keep you updated on that. However, on the aztk side of things, there is nothing that can be done as the node allocation is a Batch service responsibility.

jafreck avatar May 15 '18 19:05 jafreck