cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Node group is un-schedulable if it has had a history of it failing to be added (even if a subsequent addition succeeded)

Open RobertLucian opened this issue 3 years ago • 0 comments

Description

If a node group failed to be added to the cluster prior to running a successful subsequent cortex cluster configure on the same node group, then the pods that need to run on that will be un-schedulable.

How to reproduce

Try adding a node group "abc" to the cluster and CTRL-C it while it's being added (when eksctl command is running).

Next, re-run the cortex cluster configure on the same cluster config and let the command finish this time.

Next, run cortex cluster info to see how the node group appears to be healthy.

Add the node_groups: ["abc"] field to an API to force it to use that specific node group (say a realtime API) and deploy the API. You will notice that:

  1. The pod will never get scheduled.
  2. No node from the said node group will be added to the cluster (assuming it starts with 0 instances).

Quick solution

The quick solution is to remove and re-add the node group from the cluster. This can be done using the cortex cluster configure command.

Better solution

When there is a new node group to be added to the cluster (when configuring it with cortex cluster configure), make it such that the new node group that needs to be added is first removed with eksctl. This will not fail because if eksctl doesn't have anything to delete, it will still exit with an exit code of 0. Alternatively, if there is already a node group, it will remove it, and then it will add it back in.

This requires modifications to the install.sh script.

RobertLucian avatar Jun 17 '21 20:06 RobertLucian