cdap icon indicating copy to clipboard operation
cdap copied to clipboard

[CDAP-19455] Transition runs throttled by RuntimeJobManager to "rejected" state instead of "failed"

Open arjan-bal opened this issue 3 years ago • 1 comments

CDAP 19455

When Dataproc submit job API is called, it returns a 429 HTTP code or RESOURCE_EXHAUSTED grpc code when their master agent isn't running on Dataproc master node. This transitions the pipeline to a failed state. Rejected state better represents rate limiting errors, so this PR aims to transition such failures to rejected state instead.

A new exception TooManyRequestsException is added to the runtime SPI. When runtime job managers throw this exception, pipeline is transitioned to rejected state by RemoteExecutionTwillRunnerService.

Added a function call to handle pipeline completion when pipeline transitions to rejected state. Previously transitions to rejected happened before reaching the starting state. In such cases call to handle pipeline completion is avoided.

Manually tested to ensure pipelines with unhealthy dataproc master node are rejected: Screenshot 2022-07-29 at 5 43 54 PM

Manually tested to ensure flow control still works as it used to. Screenshot 2022-07-29 at 5 45 34 PM

arjan-bal avatar Jul 29 '22 12:07 arjan-bal

gitpod-io[bot] avatar Jul 29 '22 12:07 gitpod-io[bot]