databricks-sdk-py icon indicating copy to clipboard operation
databricks-sdk-py copied to clipboard

Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests`

Open nfx opened this issue 1 year ago • 5 comments

Changes

Platform doesn't seem to send Retry-After header correctly with this response, otherwise, the error would have been TimeoutError. This PR adds retries for error responses with this message.

Fixes https://github.com/databrickslabs/ucx/issues/401

Tests

  • [ ] make test run locally
  • [ ] make fmt applied
  • [ ] relevant integration tests applied

nfx avatar Oct 07 '23 20:10 nfx

Codecov Report

All modified lines are covered by tests :white_check_mark:

Files Coverage Δ
databricks/sdk/core.py 80.36% <ø> (ø)

:loudspeaker: Thoughts on this report? Let us know!.

codecov-commenter avatar Oct 07 '23 20:10 codecov-commenter

@pietern yes, with Retry-After from platform we'd go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L35-L39

but now I want it to go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L40-L41

All of transient errors eventually have to go away - https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/core.py#L1096-L1102, but there'll always be cases when something is not mapped out correctly.

It's a fix for our SEV1 issue - e.g. https://github.com/databrickslabs/ucx/issues/401 happens sometimes 6 hours into running a workflow, which could be frustrating.

nfx avatar Oct 10 '23 18:10 nfx

Why not retry on all 429s? I get the other bits, but for this issue, if we simply retry on all 429s, we fix this particular error, as well as the others that already produce the right status code but don't include the Retry-After header.

pietern avatar Oct 10 '23 18:10 pietern

Why not retry on all 429s?

@pietern we can, sure, preferably after https://github.com/databricks/databricks-sdk-py/pull/376 is merged. it'll be straightforward to add except TooManyRequests as err: if err.retry_after_seconds is not None: sleep_seconds = err.retry_after_seconds. E.g. as a follow-up to merging #376, this added line can go away.

without #376 it's going to be more hairy than adding a string match 🤷

and it's a fix for our SEV1.

nfx avatar Oct 10 '23 18:10 nfx

#402 ensures that the SDK will retry when the Retry-After header is missing, and I'm filing ES tickets internally with teams whose APIs are not compliant by responding with other status codes when rate limits are being reached. Would that be sufficient? I'd prefer not to add more edge cases into the SDK if possible.

mgyucht avatar Oct 24 '23 08:10 mgyucht