rabbitmq-server
rabbitmq-server copied to clipboard
Clustering based on ASG fails if one or more nodes in the ASG is terminated
RabbitMQ nodes will stop with an error if an ASG contains terminated instances that is no longer possible to describe via an EC2 API endpoint:
2018-01-24 08:38:12.257 [error] <0.214.0> Error fetching node list via EC2 API, request path: /?Action=DescribeInstances&InstanceId.3=i-0532xxxdc49605ea5&InstanceId.4=i-034xxxbdc2ad23fe&Version=2015-10-01, error: "Bad Request"
2018-01-24 08:38:12.257 [error] <0.214.0> Cannot discover any nodes: DescribeInstances API call failed.
As you can see it retrieved the instances in the ASG successfully (instanceID 3 and 4 is populated), but one of these are terminated and no longer possible to "describe", which returns a 500 error from the API for the entire request. Even though there is Healthy/InService hosts in the ASG, the node fails to discover these since describe-instances failed.
Perhaps it shoud only return Healthy/inService nodes from the initial describe autoscaling-group that provides the instance IDs, or run the DescribeInstances API request once per instance id, so that it has the ability to fail gracefully on StandBy/Terminated hosts, but still loop through and discover the InService hosts to cluster with.
Thanks for the details. I edited the issue to be less alarming and clearer.
I'm looking into two options:
- Filtering by
instance-state-name - Checking instance status using DescribeInstanceStatus first
We decided to introduce an integration suite that will use ASGs first, so this will take longer but I hope to get it into 3.7.4.
A proper test suite is taking longer than expected, so this is now scheduled for 3.7.5.
Related: rabbitmq/rabbitmq-peer-discovery-aws#20.
We currently have quite a few things going into 3.7.5 which we'd like to ship earlier. So this may have to wait, re-scheduling for 3.7.6.
any update on this? was this fixed in 3.7.6 or still pending ?