aztk icon indicating copy to clipboard operation
aztk copied to clipboard

wait_until_cluster_is_ready not timing out on start task failure

Open AndreiPopescuSK opened this issue 7 years ago • 1 comments

Hello,

I'm using the SDK (v0.8.0) to spin-up an AZTK cluster. I'm also using a custom docker image, and on one instance I forgot to pass the docker registry credentials, which led to all node start tasks failing.

I would expect that in this instance, wait_until_cluster_is_ready should timeout after failing to bring up a master node after WAIT_FOR_MASTER_TIMEOUT seconds, or notice that the master start task failed. Unfortunately, this does not happen and cluster spin-up hangs indefinitely.

Presumably this is because this loop never terminates, as this line is always run. Maybe if the master start task fails, a master_node_id is never given to the cluster, so it gets stuck there?

Any idea if this is the case? Thank you for the help.

AndreiPopescuSK avatar Jun 28 '18 12:06 AndreiPopescuSK

you are correct that if all start tasks fail early enough that a master will never be elected (so no master_node_id will be set), and that loop will hang. I think the best solution here might be to check if all nodes have entered StartTaskFailed, and exit. Adding a timeout is another good option.

Thanks for pointing this out!

jafreck avatar Jun 28 '18 18:06 jafreck