steampipe-plugin-sdk
steampipe-plugin-sdk copied to clipboard
Better rate limiting support
Is your feature request related to a problem? Please describe. On a large AWS account (100 accounts +) Steampipe would run into rate limiting issues on AWS.
Describe the solution you'd like Steampipe should recognize the limits posed by each of the API endpoints, and should only queue the number of APIs based on the limit the service can entertain. At that point, Steampipe should wait (sleep) until the rate limit has lapsed.
Describe alternatives you've considered Tweaking the MAX_PARRALLEL parameter, but this has no effect. There are no other options in the aws plugin where throttling of API calls, or any setting to manage rate limits can be tweaked.
Additional context
Hi @massyn , as far as I understand it, AWS has rate limiting on a per account, per region connection, so I don't believe the number of AWS accounts would affect rate limiting. What would affect how often rate limits are hit would be how many queries are run against tables that use the same service for a particular account region, e.g., if multiple queries are made to aws_ec2_instance, aws_ec2_ami, and aws_ec2_key_pair against a single region in an account, then the EC2 API could throttle in that particular region in the account.
As far as I know, AWS shares how some service rate limits work, e.g., https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html, but these aren't available programmatically and other services aren't as transparent with their rate limits. If you've seen any way to get these limits programmatically though, please share any links you have as we'd be interested in exploring them.
Implementing some extra controls around this information is possible, but labor intensive and because not all services support it, we instead chose to handle throttling with configurations available in the AWS SDK.
For instance, in the plugin, we implement a retry backoff strategy with jitter when we encounter throttling or retryable errors in https://github.com/turbot/steampipe-plugin-aws/blob/2387b04c46101344617c0472b62d9012025e9bad/aws/service.go#L1772, like AWS recommends in https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ and https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/, though we can still hit the maximum number of retries.
Can you please have a look at some follow-up questions I have below, which I think may help us better understand your use case:
- Are you receiving any particular errors? Or do your queries just take a long time to complete?
- Do any specific connections seem to hit throttling more frequently?
- What queries, checks, and/or dashboards are you running?
- Do any seem to cause you to hit rate limits in particular?
- Are there certain tables or services you hit rate limits with more frequently?
- Does increasing
max_error_retry_attemptsand/ormin_error_retry_delayfrom https://hub.steampipe.io/plugins/turbot/aws#configuration help with hitting
Are you receiving any particular errors? Or do your queries just take a long time to complete?
- Just the RateLimit - that's the only error I am seeing in the log
Do any specific connections seem to hit throttling more frequently?
- SSM. The aws mod for controls against SSM seem to be the one getting hit the most.
What queries, checks, and/or dashboards are you running?
- I started with the AWS mod, running just one compliance framework against 120 AWS accounts. Steampipe cannot handle the load. So I created my own mod with a reduced set of controls, and still it runs out of resources.
Do any seem to cause you to hit rate limits in particular?
- Not sure what you mean. From what I can tell, Steampipe starts querying aggressively against whatever is in its queue. With 120 AWS accounts, with close to 1500 EC2 instances, there's a lot of queries going on, and many queries will have multiple API calls. When the API rate limit hits, things back off, and then eventually Postgres will crash. From a system behaviour perspective, it is likely that the rate limit is not the root cause, but rather a contributor of the bigger issue, like how Steampipe handles jobs in its queue, and how it manages data already retrieved or still to be retrieved
Are there certain tables or services you hit rate limits with more frequently?
- SSM and EC2.
Does increasing max_error_retry_attempts and/or min_error_retry_delay from https://hub.steampipe.io/plugins/turbot/aws#configuration help with hitting
- No. My observation has been that Steampipe runs everything in memory. Eventually the server will crash. I don't see anything on the console, but when I take a screenshot of the EC2 instance, there is an "Out of Memory" error on the console. I am already running a t2.xlarge instance. It does not make sense to keep throwing memory at the problem. I will post a more detailed account of my observations in Slack shortly.
@massyn If I understand correctly, the main effect of how Steampipe and the AWS plugin are hitting the AWS API is that Steampipe will eventually cause the server/instance to run out of memory?
Also, can you please share some log messages, console outputs, and/or screenshots of the errors and crashes you're seeing? These can help us try and reproduce and diagnose from our end. Thanks!
@kaidaguerre, I've transferred this issue to the SDK repo for better tracking, as this looks like an SDK issue you are already working on. Please let us know if anything is required from the plugin team.
@massyn have you tried the new rate limiting in v21? https://steampipe.io/blog/memory-management-rate-limiters-diagnostics