cumulus icon indicating copy to clipboard operation
cumulus copied to clipboard

Run resource executions in parallel

Open pauldraper opened this issue 9 years ago • 9 comments

Solves #80

For me, it reduces an S3 diff from 140 to 20 seconds over wifi.

pauldraper avatar Dec 11 '15 17:12 pauldraper

AWS actually limits us on the number of API calls that we can make in any given second. This limit applied to the entire account. Running these in parallel will likely cause us to hit our API limits which will result in random S3 calls failing.

dtorgy avatar Dec 11 '15 17:12 dtorgy

So I went and did a smoke test of this and it does speed up s3 quite a bit. It also makes us hit our rate limit on elb and autoscaling, and it's eating Exceptions. After I made the following change in each_difference we saw rate limits right away

pool.post do
            begin
              if !aws_resources.include?(key)
                f.call(key, [added_diff(resource)])
              else
                f.call(key, diff_resource(resource, aws_resources[key]))
              end
            rescue => e
              puts "Exception: #{e}"
            end
          end
keilan@keilan:~/lucid/cumulus-paul$ time ./bin/cumulus.rb --root /var/lucid/ops/scripts/cumulus/ --config /var/lucid/ops/scripts/cumulus/configuration.json autoscaling diff
AutoScaling Group SupportToolsWebGroup has the following changes:
    Health check type: AWS - EC2, Local - ELB
    Health check grace period: AWS - 900, Local - 600
AutoScaling Group DocumentService has the following changes:
    Health check type: AWS - EC2, Local - ELB
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded
Exception: Rate exceeded

This could be workable if we are not eating exceptions (some exceptions really do need to stop execution) and we need to also not do the sync in parallel. (each_difference is used by both diffing and syncing) and we need to only do it on modules that need it (s3 is probably okay).

krjackso avatar Dec 11 '15 20:12 krjackso

Also, do we need Paul's sublime text config in the repo?

msiebert avatar Dec 11 '15 21:12 msiebert

it's eating Exceptions

I'd expect nothing less from Ruby.

pauldraper avatar Dec 11 '15 21:12 pauldraper

Also, do we need Paul's sublime text config in the repo?

No...it's not in the .gitignore :( I'll add it.

pauldraper avatar Dec 11 '15 21:12 pauldraper

AWS actually limits us on the number of API calls that we can make in any given second.

This won't be hard to change. Where are the limits documented?

pauldraper avatar Dec 11 '15 21:12 pauldraper

@krjackso, I think I've fixed the issues.

  • We catch the first exception and shutdown the thread pool and raise it.
  • AWS doesn't document the limits for all of their APIs, except for the fact that they exist. They say that the best way to handle that is retries with exponential backoff, which the Ruby client already does. I made the # of retries configurable, with a suggested 5 instead of the default 3. Actually, I made the client config object able to be used with any of the config parameters for Ruby's AWS client.
  • I make the parallelism configurable, with a suggested default of 5 rather than 10.

pauldraper avatar Dec 14 '15 20:12 pauldraper

The problem I'm seeing with the retries still is that we are still hitting the throttling limit, but just trying more times. Not sure that solution will work for something like ELB where we expect to not be throttled in other places, @dtorgy should be able to make that decision though. If we don't want parallelism for the modules we get rate limited on, we could have an opt-out value in config per-module pretty easily. Syncing in parallel shouldn't cause a problem that I can think of... We don't guarantee the order anyways when syncing so it should be no different.

krjackso avatar Dec 16 '15 01:12 krjackso

The problem I'm seeing with the retries still is that we are still hitting the throttling limit, but just trying more times.

Retries are actually the AWS-recommended solution to hitting their undocumented limits.

If an API request exceeds the API request rate for its category, the request returns the RequestLimitExceeded error code. To prevent this error, ensure that your application doesn't retry API requests at a high rate. You can do this by using care when polling and by using exponential back-off retries.

But we don't save much on ELBs anyway -- they take seconds not minutes to diff -- so I'm fine with setting the parallelism per API.

pauldraper avatar Dec 16 '15 06:12 pauldraper