cortex Handle AWS termination notice for spot instances

Handle AWS termination notice for spot instances

Open deliahu opened this issue 4 years ago • 1 comments

Motivation

Respond to spot instance terminations more gracefully. That is to prevent getting failed requests when the traffic is supposed to migrate from the terminating instance to another one that is healthy.

Questions

What is the current behavior, and what would this achieve that's better? Does the cluster autoscaler help with this at all?

Description

https://github.com/aws/aws-node-termination-handler
https://itnext.io/the-definitive-guide-to-running-ec2-spot-instances-as-kubernetes-worker-nodes-68ef2095e767

Edit (Research)

Some relevant articles here:

https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices
https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html
https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service

If we add aws-node-termination-handler and make kubectl drain the node upon notice, then I think the serving container will react to that by rejecting the requests currently in the queue and for those that are still being processed to finish. For testing, killing/terminating the instance might not be the best way to run this - instead, a way of reproducing the termination notice that AWS emits has to be found.

With https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html and with the kubectl drain procedure we might be able to gracefully transition to a healthy instance. And it looks like the back-end connection timeout is set to 300 seconds before the ELB kills the requests headed to the de-registering instance. We’d probably want to set that to 120 seconds, to match the termination notice period.

Mar 29 '20 16:03 deliahu

I just tested this and I observe that my requests timeout when a spot instance node goes down. For testing I use locust to put load on the instance and I use AWS Fault Injector Simulator to provoke spot termination. Maybe AWS's spot termination handler can fix the issue.

As a workaround I got some success by patching the virtualservice for the API to use a retry:

     {
        "match": [
          {
            "uri": {
              "prefix": "/myapi/"
            }
          }
        ],
        "retries": {
          "attempts": 2,
          "retryOn": "gateway-error,connect-failure,refused-stream,reset",
          "perTryTimeout": "10s"
        },
        "rewrite": {
          "uri": "/"
        },

https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry

Dec 10 '21 17:12 tfriedel

cortex cortex copied to clipboard

Handle AWS termination notice for spot instances

Motivation

Questions

Description

Edit (Research)

cortex
cortex copied to clipboard