cortex
cortex copied to clipboard
Handle AWS termination notice for spot instances
Motivation
Respond to spot instance terminations more gracefully. That is to prevent getting failed requests when the traffic is supposed to migrate from the terminating instance to another one that is healthy.
Questions
- What is the current behavior, and what would this achieve that's better? Does the cluster autoscaler help with this at all?
Description
- https://github.com/aws/aws-node-termination-handler
- https://itnext.io/the-definitive-guide-to-running-ec2-spot-instances-as-kubernetes-worker-nodes-68ef2095e767
Edit (Research)
Some relevant articles here:
- https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices
- https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html
- https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service
If we add aws-node-termination-handler
and make kubectl
drain the node upon notice, then I think the serving container will react to that by rejecting the requests currently in the queue and for those that are still being processed to finish. For testing, killing/terminating the instance might not be the best way to run this - instead, a way of reproducing the termination notice that AWS emits has to be found.
With https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html and with the kubectl
drain procedure we might be able to gracefully transition to a healthy instance. And it looks like the back-end connection timeout is set to 300 seconds before the ELB kills the requests headed to the de-registering instance. We’d probably want to set that to 120 seconds, to match the termination notice period.
I just tested this and I observe that my requests timeout when a spot instance node goes down. For testing I use locust to put load on the instance and I use AWS Fault Injector Simulator to provoke spot termination. Maybe AWS's spot termination handler can fix the issue.
As a workaround I got some success by patching the virtualservice for the API to use a retry:
{
"match": [
{
"uri": {
"prefix": "/myapi/"
}
}
],
"retries": {
"attempts": 2,
"retryOn": "gateway-error,connect-failure,refused-stream,reset",
"perTryTimeout": "10s"
},
"rewrite": {
"uri": "/"
},
https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry