databricks-sdk-py
databricks-sdk-py copied to clipboard
Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests`
Changes
Platform doesn't seem to send Retry-After
header correctly with this response, otherwise, the error would have been TimeoutError
. This PR adds retries for error responses with this message.
Fixes https://github.com/databrickslabs/ucx/issues/401
Tests
- [ ]
make test
run locally - [ ]
make fmt
applied - [ ] relevant integration tests applied
Codecov Report
All modified lines are covered by tests :white_check_mark:
Files | Coverage Δ | |
---|---|---|
databricks/sdk/core.py | 80.36% <ø> (ø) |
:loudspeaker: Thoughts on this report? Let us know!.
@pietern yes, with Retry-After
from platform we'd go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L35-L39
but now I want it to go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L40-L41
All of transient errors eventually have to go away - https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/core.py#L1096-L1102, but there'll always be cases when something is not mapped out correctly.
It's a fix for our SEV1 issue - e.g. https://github.com/databrickslabs/ucx/issues/401 happens sometimes 6 hours into running a workflow, which could be frustrating.
Why not retry on all 429s? I get the other bits, but for this issue, if we simply retry on all 429s, we fix this particular error, as well as the others that already produce the right status code but don't include the Retry-After
header.
Why not retry on all 429s?
@pietern we can, sure, preferably after https://github.com/databricks/databricks-sdk-py/pull/376 is merged. it'll be straightforward to add except TooManyRequests as err: if err.retry_after_seconds is not None: sleep_seconds = err.retry_after_seconds
. E.g. as a follow-up to merging #376, this added line can go away.
without #376 it's going to be more hairy than adding a string match 🤷
and it's a fix for our SEV1.
#402 ensures that the SDK will retry when the Retry-After header is missing, and I'm filing ES tickets internally with teams whose APIs are not compliant by responding with other status codes when rate limits are being reached. Would that be sufficient? I'd prefer not to add more edge cases into the SDK if possible.