python-slack-sdk icon indicating copy to clipboard operation
python-slack-sdk copied to clipboard

Improve rate-limit handling

Open kkerce opened this issue 7 months ago • 10 comments

Reproducible in:

The Slack SDK version

slack-sdk==3.34.0 slackeventsapi==3.0.3

Python runtime version

Python 3.8.5

OS info

90~20.04.1-Ubuntu SMP Tue Apr 22 09:59:53 UTC 2025

Recently my Slack app continuously failed for a few hours in the following manner (US Eastern time).

2025-05-27 11:07:23,389 ERROR    slack_sdk.web.base_client Failed to decode Slack API response: Received a response in a non-JSON format: 
2025-05-27 11:07:23,389 ERROR    slack_sdk.socket_mode.builtin.client Failed to run a request listener: The request to the Slack API failed. (url: https://slack.com/api/team.info)
The server responded with: {'ok': False, 'error': 'Received a response in a non-JSON format: '}

The body of the response from the Slack API was empty.

Later that day, starting at 2025-05-27 14:16:56 and ending at 14:18:05 I ran the following commands seven times.

curl -vH "Authorization: Bearer <token-redacted>" https://slack.com/api/team.info; echo

The seventh run produced the following output.

*   Trying 54.92.199.186:443...
* TCP_NODELAY set
* Connected to slack.com (54.92.199.186) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=slack.com
*  start date: Mar 28 10:53:37 2025 GMT
*  expire date: Jun 26 10:53:36 2025 GMT
*  subjectAltName: host "slack.com" matched cert's "slack.com"
*  issuer: C=US; O=Let's Encrypt; CN=R10
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x5632a8e5e0d0)
> GET /api/team.info HTTP/2
> Host: slack.com
> user-agent: curl/7.68.0
> accept: */*
> authorization: Bearer <token-redacted>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 429
< x-edge-backend: envoy-www
< via: envoy-edge-iad-swgvnnyy
< x-envoy-ratelimited: true
< x-slack-edge-shared-secret-outcome: no-match
< date: Tue, 27 May 2025 18:18:05 GMT
< server: envoy
<
* Connection #0 to host slack.com left intact

Note in particular:

  • the empty response body
  • the x-envoy-ratelimited: true header

I described the problem to Slack support and in their prompt and helpful response they indicated the x-envoy-ratelimited: true header suggests this was handled via Slack's CDN or API gateway, which in rare cases may result in a 429 with an empty body instead of a JSON-formatted error. Requests are being throttled before reaching the Slack application layer.

The Python Slack SDK clearly doesn't handle this scenario well, e.g., the SDK's code is unable to honor any rate-limiting error handlers that are configured. Can the handling be improved?

Thanks.

kkerce avatar May 29 '25 15:05 kkerce

Hi @kkerce 👋🏻

Thanks for the detailed issued and including the response from Slack support! 🙇🏻

@kkerce Could you confirm that my understanding is correct: the Slack API response status code was 429 but the body was empty. This is because the CDN is handling the rate limit instead of the Slack API. Since the status code was 429, the SDK continually retried and always received an empty body. The correct approach would be for the SDK to error when the status code is 429 and there is a x-envoy-ratelimited: true header.

@WilliamBergamin What do you think about extending the SDK's rate limit handler to support the CDN use-case that returns a x-envoy-ratelimited: true header?

mwbrooks avatar May 29 '25 20:05 mwbrooks

Hi @mwbrooks and thanks for taking a look at the issue!

Confirming that most of what you wrote above asking me to confirm is true, except I did not see any behavior from slack-sdk==3.34.0 to indicate it recognized the 429 and retried (despite that my client's retry_handlers is configured with RateLimitErrorRetryHandler(max_retry_count=2), and that I have historically seen retries from the SDK when the Slack API indicated 429 [as opposed to the CDN case]).

kkerce avatar May 30 '25 11:05 kkerce

Interesting the RateLimitErrorRetryHandler should be retrying on every request with a status code 429 regardless of x-envoy-ratelimited: true 🤔

@kkerce could you share how you are configuring the RetryHandler?

WilliamBergamin avatar May 30 '25 16:05 WilliamBergamin

@WilliamBergamin In the following manner.

from slack_sdk.web import WebClient
from slack_sdk.http_retry.builtin_handlers import RateLimitErrorRetryHandler

slackClient = WebClient(token="<redacted>")
slackClient.retry_handlers.append(RateLimitErrorRetryHandler(max_retry_count=2))

Regardless of whether a retry handler was configured via the SDK, wouldn't we expect a different exception than the one I mentioned above?

Failed to decode Slack API response: Received a response in a non-JSON format:

As an aside but likely relevant for the near future: Because the circumstance seems relatively obscure and (my guess) doesn't occur often, ideally whoever ends up testing the SDK will need to somehow simulate rate-limiting at the level of Slack's CDN or API gateway, or otherwise convince Slack to configure a test scenario on their side at the CDN / API gateway level.

Thanks for your help.

kkerce avatar May 30 '25 16:05 kkerce

Just my two cents: RateLimitErrorRetryHandler expects a "retry-after" response header, which tells you how long to wait before making the next request. However, in this case, the Envoy rate limiting situation does not provide that information, so I don't think the current retry logic works well. Additionally, I am not sure if retrying will help here, since it seems like a situation where Slack's backend infra (specifically Envoy proxy? or its backend?) is unable to handle a large number of requests.

seratch avatar Jun 02 '25 16:06 seratch

We could attempt to implement a special case for this situation where we set a long retry after value such as 2 minutes when the response has a status code 429 and there is no "retry-after" response header 🤔

But setting a long timeout like this as a default fallback could has unexpected behavior on tasks that are time sensitive, this value could also be surfaced as a configurable field

WilliamBergamin avatar Jun 02 '25 20:06 WilliamBergamin

👋 It looks like this issue has been open for 30 days with no activity. We'll mark this as stale for now, and wait 10 days for an update or for further comment before closing this issue out. If you think this issue needs to be prioritized, please comment to get the thread going again! Maintainers also review issues marked as stale on a regular basis and comment or adjust status if the issue needs to be reprioritized.

github-actions[bot] avatar Jul 07 '25 00:07 github-actions[bot]

In the x-envoy-ratelimited: true header situation, can the Python Slack SDK at least simply recognize that it's a rate-limiting situation and not throw a Failed to decode Slack API response: Received a response in a non-JSON format: exception? The response in a non-JSON format exception will tend to lead developers down a path that's ultimately not helpful, e.g., the developer will need to investigate down to the "view all response headers" level (like I indicated here), only to find out it was a rate-limiting problem.

kkerce avatar Jul 07 '25 11:07 kkerce

👋 It looks like this issue has been open for 30 days with no activity. We'll mark this as stale for now, and wait 10 days for an update or for further comment before closing this issue out. If you think this issue needs to be prioritized, please comment to get the thread going again! Maintainers also review issues marked as stale on a regular basis and comment or adjust status if the issue needs to be reprioritized.

github-actions[bot] avatar Aug 18 '25 00:08 github-actions[bot]

See this.

kkerce avatar Aug 18 '25 11:08 kkerce