kyuubi
kyuubi copied to clipboard
Fix AppState when Engine connection is terminated
:mag: Description
Issue References ๐
This issue was noticed a few times when the batch state was set to ERROR, but the appState kept the non-terminal state forever (e.g. RUNNING), even if the application was finished (in this case Yarn Application).
{
"id": "********",
"user": "****",
"batchType": "SPARK",
"name": "*********",
"appStartTime": 0,
"appId": "********",
"appUrl": "********",
"appState": "RUNNING",
"appDiagnostic": "",
"kyuubiInstance": "*********",
"state": "ERROR",
"createTime": 1725343207318,
"endTime": 1725343300986,
"batchInfo": {}
}
It seems that this happens when there is some intermittent failure during the monitoring step and the batch ends with ERROR, leaving the application metadata without an update. This can lead to some misinterpretation that the application is still running. We need to set this to UNKNOWN state to avoid errors.
Describe Your Solution ๐ง
This is a simple fix that only checks if the batch state is ERROR and the appState is not in a terminal state and changes the appState to UNKNOWN, in these cases (during the batch metadata update).
Types of changes :bookmark:
- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Test Plan ๐งช
Behavior Without This Pull Request :coffin:
If there is some error between the Kyuubi and the Application request (e.g. YARN client), the batch is finished with ERROR state and the application keeps the last know state (e.g. RUNNING).
Behavior With This Pull Request :tada:
If there is some error between the Kyuubi and the Application request (e.g. YARN client), the batch is finished with ERROR state and the application has a non-terminal state, it is forced to UNKNOWN state.
Related Unit Tests
I've tried to implement a unit test to replicate this behavior but I didn't make it. We need to force an exception in the Engine Request (e.g. YarnClient.getApplication) but we need to wait for the application to be in the RUNNING state before raising this exception, or maybe block the connection between kyuubi and the engine.
Checklist ๐
- [ ] This patch was not authored or co-authored using Generative Tooling
Be nice. Be informative.
@joaopamaral please fix the code style. (simply run dev/reformat if you are using Linux or macOS)
reopen to retest
Codecov Report
Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
Project coverage is 0.00%. Comparing base (
2d64255) to head (8409eac). Report is 58 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...g/apache/kyuubi/operation/BatchJobSubmission.scala | 0.00% | 4 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #6722 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 684 687 +3
Lines 42282 42445 +163
Branches 5767 5793 +26
=======================================
- Misses 42282 42445 +163
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
๐จ Try these New Features:
- Flaky Tests Detection - Detect and resolve failed and flaky tests
- JS Bundle Analysis - Avoid shipping oversized bundles
thanks, merged to master/1.10.1/1.9.3