legacy-jclouds icon indicating copy to clipboard operation
legacy-jclouds copied to clipboard

Amazon EC2 "503 request limit exceeded" errors

Open andreaturli opened this issue 11 years ago • 14 comments

Intermittently running against EC2 it can happen to see:

org.jclouds.http.HttpResponseException: command:

POST https://ec2.eu-west-1.amazonaws.com/ HTTP/1.1 failed with response: HTTP/1.1 503 Service Unavailable; content: [Request limit exceeded.]

This issue was reported and discussed also at https://groups.google.com/forum/#!msg/jclouds-dev/WtNzfqtNfuE/PrYXsjP8RTYJ

andreaturli avatar Jan 21 '13 17:01 andreaturli

Hi there. I will also chime in here that this bug is hitting me as well. The more nodes you attempt to create - the more likely you'll receive the "request limit exceeded" error from EC2. For my part i'm using Whirr as the wrapper over jclouds.

I did see reference to this back in May 2012 here: https://groups.google.com/forum/#!msg/jclouds-dev/MLYsvOS025o/n1CtL5yGhasJ

does it look like there's any hope to solve this one?

spragues-trulia avatar Feb 05 '13 02:02 spragues-trulia

Hi, there. there's a significant amount of work on towards this in 1.6.

This is the larger issue about controlling commands better: https://github.com/jclouds/jclouds/issues/1089

This is in 1.6.0-alpha.2 and changes to use multi-id describe calls when polling for instances active: https://github.com/jclouds/jclouds/commit/bd4f5cfba2d34a6e995e1c29cffc827979961cff

This is the start for openstack, where the issue arises more often. A similar exception coercion on 503 may be possible on ec2, depending on whether retryAfter information is available. If not, the only approach is to further work on reducing calls: https://github.com/jclouds/jclouds/pull/1056

There's more to do, and this is not forgotten. I'll keep this open until it is sorted, guessing by March depending on if anyone helps.

codefromthecrypt avatar Feb 05 '13 02:02 codefromthecrypt

Awesome Adrian. Thank you for the update!

spragues-trulia avatar Feb 05 '13 18:02 spragues-trulia

Hi Adrian, Just checkin to see if all is okay! Vitamins are being taken, pizza is still be delivered on time, crime is low in the neighborhood and possibly, maybe, this work in 1.6 is proving doable? :)

cheers!

spragues-trulia avatar Mar 13 '13 20:03 spragues-trulia

Hey Adrian. I imagine you are super busy but is there any way you can throw me bone on this one? Just looking for any kind of update.

spragues-trulia avatar Mar 18 '13 21:03 spragues-trulia

Hi Stephen

Just looking for any kind of update.

Adrian will certainly be better placed to give a definitive update - just wondering whether you've had a chance to test recently using the latest 1.6.0-rc.1 release? That should include some of the changes referenced in this issue, and perhaps is already helping improve the situation a little.

demobox avatar Mar 19 '13 03:03 demobox

Hi, Thanks for replying. I'm using Apache Whirr actually (which uses the jclouds libs) and after giving it my best shot I've found that the current release of that doesn't work with the newest release of this. So now i've got to regroup and figure out where to go from here.

Thanks!

spragues-trulia avatar Mar 19 '13 20:03 spragues-trulia

current release of that doesn't work with the newest release of this

Ack. Sorry to hear, Stephen. @abayer: I see a patch to upgrade to 1.5.8. Is there a chance of looking at 1.6.0-rc.1?

demobox avatar Mar 19 '13 21:03 demobox

I've been wanting this one for a long time since I have experience generating 20 4-node clusters. I avoid HTTP 503 (Request limit exceeded) by ad hoc timing at the top level (time between consecutive cluster creates) and by only doing one cluster create at a time. This is not ideal since it requires that I leave plenty of slack to avoid an avalanche and I want the clusters ready for a deadline.

Worst case scenario: everyone and everything tries harder to contact AWS EC2 services. In 2011, my client was locked out of EC2 api and AWS Console for 6 hours. I have not seen that again, so I hope AWS realized their user experience error and fixed it.

Note that getting 503 means commandline Whirr will quit and allocated nodes will be completely inaccessible in the short term. I call these
"orphaned nodes", and human attention is needed to clean them up. (I am adding auto-destruct timers to my custom AMIs to clean these up.)

AWS EC2 will fulfill, but slow down (increase average latency), api requests before it kills the conversation with HTTP 503. The duration of a list-instances or tag-instances request increases dramatically (10x to 50x) before HTTP 503, so detecting a 3x increase in latency should be sufficient to trigger "extra patience".

I don't know if OpenStack does the same, but the source code is available. What I would like to see is REST services add HTTP headers to responses to provide "hold off for N secs" meta-comments. I've done this in REST services I implemented to add detailed error messages since HTTP codes can have multiple meanings.

Implementation Comments:

  1. api operation-specific retry timing helps (especially if number of open sockets counts against you), but does not address the base request rate.
  2. eliminating unnecessary requests is essential, of course.
  3. being able to recover from a 503 would be wonderful, but might be difficult since assessing remote state requires making more requests.
  4. It should be possible to detect throttling before the HTTP 503, at least for AWS EC2. "Extra patience" could be implemented as a globally enforced hold on aws ec2 requests for N seconds (settable property), perhaps with a multiplier like 1.3 for each incident.

Obviously, dynamic anti-throttling requires a centralized flow control mechanism that can sleep before making a request when the "Extra patience" countdown timer is > 0. Also, there should be no adverse overhead if this mechanism is disabled.

On 20130204 18:39 , Adrian Cole wrote:

Hi, there. there's a significant amount of work on towards this in 1.6.

This is the larger issue about controlling commands better: #1089 https://github.com/jclouds/jclouds/issues/1089

This is in 1.6.0-alpha.2 and changes to use multi-id describe calls when polling for instances active: bd4f5cf https://github.com/jclouds/jclouds/commit/bd4f5cfba2d34a6e995e1c29cffc827979961cff

This is the start for openstack, where the issue arises more often. A similar exception coercion on 503 may be possible on ec2, depending on whether retryAfter information is available. If not, the only approach is to further work on reducing calls: #1056 https://github.com/jclouds/jclouds/issues/1056

There's more to do, and this is not forgotten. I'll keep this open until it is sorted, guessing by March depending on if anyone helps.

tralfamadude avatar Mar 19 '13 23:03 tralfamadude

I just tested using the 1.6.0-rc.1 release, and I'm getting this error when jclouds attempts to customize my nodes.

I requested 32 CC2.8xlarge instances, put into a placement group. I can split up the deployment, but that's problematic since occasionally amazon doesn't have capacity to put them all in one group, so I end up with half the nodes I need already reserved.

EDIT: Further investigation reveals that the problem doesn't manifest when I set: properties.setProperty(AWSEC2Constants.PROPERTY_EC2_GENERATE_INSTANCE_NAMES, "false");

charlesmunger avatar Mar 20 '13 06:03 charlesmunger

well.... i've pretty much given up on this but i am kinda curious if it ever got resolved. something tells me no.

spragues-trulia avatar May 02 '13 21:05 spragues-trulia

well to add context, many of us have been busy getting jclouds ready for transition into apache, something that displaces time for issues like this, for a long-term greater good. Please follow https://github.com/jclouds/jclouds/issues/1576 and open a jira on apache incubator jclouds as soon as it is up.

codefromthecrypt avatar May 02 '13 22:05 codefromthecrypt

Did you try it with the workaround I posted above?

charlesmunger avatar May 02 '13 22:05 charlesmunger

@adriancole - that's good to hear. cool.

@charlesmunger - yeah that looks like java code. i'm a layer or two above that as i access the jclouds libs via apache whirr for which its not clear to me how to influence that setting. i will nose around though. thanks.

spragues-trulia avatar May 02 '13 22:05 spragues-trulia