rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Clustering based on ASG fails if one or more nodes in the ASG is terminated

Open kimma-basefarm opened this issue 7 years ago • 7 comments

RabbitMQ nodes will stop with an error if an ASG contains terminated instances that is no longer possible to describe via an EC2 API endpoint:

2018-01-24 08:38:12.257 [error] <0.214.0> Error fetching node list via EC2 API, request path: /?Action=DescribeInstances&InstanceId.3=i-0532xxxdc49605ea5&InstanceId.4=i-034xxxbdc2ad23fe&Version=2015-10-01, error: "Bad Request"
2018-01-24 08:38:12.257 [error] <0.214.0> Cannot discover any nodes: DescribeInstances API call failed.

As you can see it retrieved the instances in the ASG successfully (instanceID 3 and 4 is populated), but one of these are terminated and no longer possible to "describe", which returns a 500 error from the API for the entire request. Even though there is Healthy/InService hosts in the ASG, the node fails to discover these since describe-instances failed.

Perhaps it shoud only return Healthy/inService nodes from the initial describe autoscaling-group that provides the instance IDs, or run the DescribeInstances API request once per instance id, so that it has the ability to fail gracefully on StandBy/Terminated hosts, but still loop through and discover the InService hosts to cluster with.

kimma-basefarm avatar Jan 24 '18 14:01 kimma-basefarm

Thanks for the details. I edited the issue to be less alarming and clearer.

michaelklishin avatar Jan 24 '18 14:01 michaelklishin

I'm looking into two options:

michaelklishin avatar Feb 07 '18 14:02 michaelklishin

We decided to introduce an integration suite that will use ASGs first, so this will take longer but I hope to get it into 3.7.4.

michaelklishin avatar Feb 07 '18 15:02 michaelklishin

A proper test suite is taking longer than expected, so this is now scheduled for 3.7.5.

michaelklishin avatar Feb 15 '18 11:02 michaelklishin

Related: rabbitmq/rabbitmq-peer-discovery-aws#20.

michaelklishin avatar Mar 08 '18 22:03 michaelklishin

We currently have quite a few things going into 3.7.5 which we'd like to ship earlier. So this may have to wait, re-scheduling for 3.7.6.

michaelklishin avatar Mar 12 '18 15:03 michaelklishin

any update on this? was this fixed in 3.7.6 or still pending ?

man-jiteshm-sportsbet avatar Sep 27 '21 01:09 man-jiteshm-sportsbet