[BUG] HTTP 403 response is categorized as Downstream Service Failure and being retried for an hour
Describe the bug
Hi Team,
We run Spark jobs on Databricks using Flyte.
Came across an interesting issue. When we use an expired Databricks API token, our Databricks tasks run for an hour and then fail with an error like this:
I did a quick investigation and found that:
- Flyte categorizes the HTTP 403 Forbidden error as a Downstream Service Failure and applies the Downstream System Retry here
- Task is retried 30 times - I guess this is a default value, since we did not set MaxNodeRetriesOnSystemFailures in our FlyteProperrel config
This results in a pretty bad user experience becasues in some cases, users override the global Databricks API token with their own token in TaskConfig for testing purposes. When their token expires, the task runs for an hour and then fails with the above error.
Is there a way to make the user experience smoother for this use case? I guess lowering the MaxNodeRetriesOnSystemFailures to a reasonable number (e.g. 3) would reduce the response time. (but it's a global setting for all kinds of downstream system failures) Can we control what kind of downstream errors are considered as service failure? (I'm thinking of excluding some HTTP response codes from system failures - e.g. 403)
Thank you, Robert
Expected behavior
Users get quick feedback on the issue - Downstream service (Databricks) returned HTTP 403 when the job was submitted.
Additional context to reproduce
No response
Screenshots
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
🤖
It's great to see you've identified the issue and thought about potential solutions. Your insights into the impact on user experience and the suggestion to adjust the MaxNodeRetriesOnSystemFailures setting are valuable. Your proposed solution to control which HTTP response codes are considered as service failures is also worth considering. Your contributions to addressing this issue would be highly appreciated. Thank you for your thoughtful input!
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hey @rambrus,
The fix here should be here: https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/webapi/databricks/plugin.go#L249
Add a new case for http.AccessDenied: and return a failure (just like the http.Unauthorized case)...
On the same vein, I think http.Unauthorized should be retryable actually... typically it means the client needs to authenticate again... but I guess that's a problem for another day :-)
@EngHabu thanks for the hint! looks good to me!