Do not allow auto scaling group to terminate unhealthy instances
I'm not exactly sure I understand the use case for killing unhealthy Couchbase EC2 instances. This seems like an overly destructive operation, particularly because you don't have an opportunity to troubleshoot or correct/failover the node gracefully and re-balance the cluster.
I encountered a scenario in my test environment where a bad query caused excessive CPU usage on multiple nodes, which then resulted in the health check of the nodes to not return healthy. The instances were terminated by the ASG and I lost data/indexes.
For now, I have added the following attribute to the aws_autoscaling_group: suspended_processes = ["Terminate", "ReplaceUnhealthy"] to prevent terminations. Once the cluster is spun up, I don't really want/need the ASG to terminate nodes.
Actually, now that I think about, I don't really understand why you would even want the nodes to be in an auto-scaling group with the way the current modules work. I could see maybe adding index/query nodes based on demand, but the modules don't currently do that, as far as I can tell.
Some options might be: make the suspended_processes variable so that they can be set, and default to not terminating instances. You could also remove the nodes from the ASG entirely and just set the desired number of nodes you want to be launched.
If neither of these are desirable, I'm curious to know why an ASG is being used as opposed to just launching a specified number of instances.
The idea is that if a node is failing health checks, it's, well, unhealthy, and you're better off replacing it automatically than to wait for someone to manually look at it and fix it. If you don't have sufficient replication, that may result in data loss, of course, so it's not the right solution in all cases, and protecting your instances may be necessary.
This repo is being archived, feel free to use a fork if necessary.