cortex
cortex copied to clipboard
Node group is un-schedulable if it has had a history of it failing to be added (even if a subsequent addition succeeded)
Description
If a node group failed to be added to the cluster prior to running a successful subsequent cortex cluster configure
on the same node group, then the pods that need to run on that will be un-schedulable.
How to reproduce
Try adding a node group "abc"
to the cluster and CTRL-C it while it's being added (when eksctl
command is running).
Next, re-run the cortex cluster configure
on the same cluster config and let the command finish this time.
Next, run cortex cluster info
to see how the node group appears to be healthy.
Add the node_groups: ["abc"]
field to an API to force it to use that specific node group (say a realtime API) and deploy the API. You will notice that:
- The pod will never get scheduled.
- No node from the said node group will be added to the cluster (assuming it starts with 0 instances).
Quick solution
The quick solution is to remove and re-add the node group from the cluster. This can be done using the cortex cluster configure
command.
Better solution
When there is a new node group to be added to the cluster (when configuring it with cortex cluster configure
), make it such that the new node group that needs to be added is first removed with eksctl
. This will not fail because if eksctl
doesn't have anything to delete, it will still exit with an exit code of 0. Alternatively, if there is already a node group, it will remove it, and then it will add it back in.
This requires modifications to the install.sh
script.