cluster-glue
cluster-glue copied to clipboard
ec2 ocf resource retry
Concerns: cluster-glue/lib/plugins/stonith/external/ec2
Seems to me that there are no retry mechanism in the EC2 OCF script. AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made. In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.
Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.
/var/log/messages 2021-09-16T16:02:04.751248+00:00 <MYHOST> external/ec2(res_AWS_STONITH)[31700]: info: status check for <MYINSTNACEID> is <-- Missing instance status report after "is" keyword
2021-09-16T16:02:04.760725+00:00 <MYHOST> external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt. 2021-09-16T16:02:13.742017+00:00 <MYHOST> external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1
Maybe some kind of fault tolerance would be nice to have I guess.
IIRC, none of the stonith plugins does that, i.e. runs in a loop until the status is correct, so this would be a precedence. A question: how often do you check the status? If it's too often and the device (in this case aws) is flaky, then you may try increasing the interval.
#35 Addresses this. The API bucket the agent uses is shared for the account's whole region and fairly small so simply extending the interval doesn't help much after a point.