athena-datasource icon indicating copy to clipboard operation
athena-datasource copied to clipboard

Retry / rate-limit queries that failed due to S3 throttling

Open skuzzle opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. We are sometimes seeing S3 throttling errors on the UI on some of our dashboards. This even happens for queries that are already cached with Athena's query reuse feature. I understand those might be root caused by some sub-optimal partitioning/data layout in our Athena setup (which we are unable to change at the moment). However, I think throttling can naturally happen if you have lots of data to crawl through. The way I understood S3 is, that throttling happens while S3 is trying to scale up to the amount of concurrent requests it needs to handle. Thus it is signalling to the client to slow down its request rate. This situation is currently not handled gracefully by the Grafana Athena datasource.

Describe the solution you'd like If my understanding of S3 throttling is correct then there should be some client site retry with backoff mechanism for queries that fail because of S3 throttling. I understand that introducing a rate limit might not be straight-forward as it likely requires tracking some global state on the Grafana Server.

Describe alternatives you've considered Sadly, I've found no alternatives yet. In a perfect world maybe Athena should already handle this situation more gracefully but we have found no respective configuration options.

Additional context We have some automation in place that tests all of our dashboard's Athena queries against Grafana's /api/ds/query endpoint. In these tests we faced the same throttling issues and were able to overcome them by adding a retry mechanism and stepwise lowering the rate limit.

skuzzle avatar Apr 08 '24 07:04 skuzzle

Hi @skuzzle , thanks for the feature request! I looked into it and I can understand it being an issue, though most of the advice I see about it involves changing the Athena configuration instead of the querying. I'll move it into the backlog for us to consider.

iwysiu avatar Apr 09 '24 18:04 iwysiu

If I may add here that it would be great if we consider expanding the scope of retry / rate-limit (or even delay query execution instead of executing them all at the same time) not just for throttling caused by S3 but also by Athena APIs .

We are currently dealing with this and I am looking into ways to workaround it. Let me know if you want me to raise this as a separate issue.

gsarristz avatar Mar 12 '25 09:03 gsarristz