cluster-glue icon indicating copy to clipboard operation
cluster-glue copied to clipboard

ec2 ocf resource retry

Open nasjomach opened this issue 4 years ago • 2 comments

Concerns: cluster-glue/lib/plugins/stonith/external/ec2

Seems to me that there are no retry mechanism in the EC2 OCF script. AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made. In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.

Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.

/var/log/messages 2021-09-16T16:02:04.751248+00:00 <MYHOST> external/ec2(res_AWS_STONITH)[31700]: info: status check for <MYINSTNACEID> is <-- Missing instance status report after "is" keyword

2021-09-16T16:02:04.760725+00:00 <MYHOST> external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt. 2021-09-16T16:02:13.742017+00:00 <MYHOST> external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1

Maybe some kind of fault tolerance would be nice to have I guess.

nasjomach avatar Sep 30 '21 09:09 nasjomach

IIRC, none of the stonith plugins does that, i.e. runs in a loop until the status is correct, so this would be a precedence. A question: how often do you check the status? If it's too often and the device (in this case aws) is flaky, then you may try increasing the interval.

dmuhamedagic avatar Oct 06 '21 14:10 dmuhamedagic

#35 Addresses this. The API bucket the agent uses is shared for the account's whole region and fairly small so simply extending the interval doesn't help much after a point.

Thr3d avatar Mar 28 '22 18:03 Thr3d