terraform-aws-alternat icon indicating copy to clipboard operation
terraform-aws-alternat copied to clipboard

"Reset" to NAT instances after failover

Open dan-greene-brivo opened this issue 1 year ago • 4 comments

I'm putting this here to see if there's any interest in adding in the ability to "fall back" to the NAT instances after a failover due to curl failure. Or am I missing something that will set it back automatically?

I'm working on the code anyway, so I'm happy to make a PR if you think it's useful.

Right now, my first thought is to update the connection check lambdas so that the 1st time through, it checks the route table and if it's set to a NAT Gateway, change it to a NAT instance just before the first check, so if it's still down, it'll immediately be changed back. Effective, but will cause a connectivity blip every minute while failed over to NAT Gateway.

Option 2 is to have a separate lambda on a separate schedule (maybe every 15 minutes by default, or only on demand?) that if the route tables are using NAT Gateways, we run an "Instance Refresh" on the ASG, forcing it to re-create the instances. In theory, we could terminate the instances, and the ASG would do it's thing as well.

Thoughts?

dan-greene-brivo avatar Apr 19 '24 20:04 dan-greene-brivo

If I understand correctly, what you're proposing is the following:

  1. NAT instance fails connectivity checks for some reason.
  2. Connectivity checker Lambda notices the failure and replaces the route to go through the NAT gateway.
  3. Now the NAT instance is sitting around doing nothing.
  4. Some time later, the NAT instance is able to connect again.
  5. There should be a process to automatically switch back to the NAT instance.

Did I understand correctly?

If so, the first challenge is to how to know that the NAT instance has connectivity again. The route table now points to the NAT gateway. You'd need either:

  1. Another, different route table that points to the NAT instance. Have a Lambda that is in a subnet that uses this route table. Have it checking connectivity. If connectivity succeeds, update the route to the instance again.
  2. Or, have the NAT instance itself check its connection and update the route once connectivity is working.

I don't think we can use a solution like your first proposal because we do not want a "connectivity blip" - remaining connected is our highest priority. Remember that the connectivity checker runs every minute (by default) so you'd be interrupting the connection quite a lot, potentially, if the NAT instance is still broken.

Option (2) could have sorta the same problem. It could trigger an instance replacement, and the new instances would automatically claim the route at boot, as usual. But if it can't connect because the problem is somewhere else (e.g. the connectivity failure is not due to the NAT instance itself, but some AWS networking issue), then you'd end up in a loop where the new Lambda runs again, finds the NAT gateway as the route, terminates the instance, rinse & repeat.

I like the idea of a self-healing NAT instance, just need to find a practical approach.

bwhaley avatar Apr 20 '24 00:04 bwhaley

I’ll start with just a lambda that resets the system while we figure out the least impactful time/mechanism to call it.

dan-greene-brivo avatar Apr 20 '24 20:04 dan-greene-brivo

@bwhaley Hi, In your use case did you ever had to manually terminate an instance? I'm wondering here if it's possible in real life for an EC2 instance to crash and lose internet connectivity. I know it will failover to managed NAT but then the EC2 will be hanging there forever until someone terminate it manually? How do you handle this scenario in your environment?

I'm thinking about adding a function to the connectivity tester lambda to terminate the instance if it loses internet connection for more than 1 minuet maybe? Do you have better ideas for such case? Also I noticed the check interval is 5 seconds and you sleep the lambda 5 seconds for 12 times so it means the lambda will stay active almost all the time giving it get's an event every 1 minuet and stays checking every 5 seconds for another minuet?

Thanks you

ahmedasmar avatar Feb 06 '25 12:02 ahmedasmar

You can set up an alert on the error Failed connectivity tests! Replacing route in the logs. You can also monitor for the route change in Cloudtrail, or maybe there's another way to set up an event for the route change.

My philosophy has been that if there's a problem with the instance, it should fail the status checks and be terminated automatically by autoscaling. If Autoscaling doesn't terminate it, then the problem may not be with the instance - it may be on the network or elsewhere in the stack, in which case a new instance won't help. You can get in to an infinite loop of terminate => launch => detect errors => terminate. Maybe that's what you want, and if so, you could use a backoff so that it doesn't retry too often.

If you end up implementing this, do let me know, it might be an interesting feature to include here!

bwhaley avatar Feb 10 '25 20:02 bwhaley