rabbitmq-server Clustering based on ASG fails if one or more nodes in the ASG is terminated

Clustering based on ASG fails if one or more nodes in the ASG is terminated

Open kimma-basefarm opened this issue 7 years ago • 7 comments

RabbitMQ nodes will stop with an error if an ASG contains terminated instances that is no longer possible to describe via an EC2 API endpoint:

2018-01-24 08:38:12.257 [error] <0.214.0> Error fetching node list via EC2 API, request path: /?Action=DescribeInstances&InstanceId.3=i-0532xxxdc49605ea5&InstanceId.4=i-034xxxbdc2ad23fe&Version=2015-10-01, error: "Bad Request"
2018-01-24 08:38:12.257 [error] <0.214.0> Cannot discover any nodes: DescribeInstances API call failed.

As you can see it retrieved the instances in the ASG successfully (instanceID 3 and 4 is populated), but one of these are terminated and no longer possible to "describe", which returns a 500 error from the API for the entire request. Even though there is Healthy/InService hosts in the ASG, the node fails to discover these since describe-instances failed.

Perhaps it shoud only return Healthy/inService nodes from the initial describe autoscaling-group that provides the instance IDs, or run the DescribeInstances API request once per instance id, so that it has the ability to fail gracefully on StandBy/Terminated hosts, but still loop through and discover the InService hosts to cluster with.