[FEATURE] Deployment - Session Upgrade
Requirements:
- EMR-S application associate with the session could be upgraded.
- Upgrade should not impact running/waiting interactive query.
- Upgrade should not impact streaming query.
Opt-1 Client manage EMR-S Job during BG upgrade
-
new job will start, sessionId is same is oldJob spark job update state as RUNNING. spark job update new JobId and new AppId keep update the heartbeat. found statements by query select * form request_index where sessionId="sessionId" and statemtnState = "waiting" and appId = "newAppId" and jobId = "newJobId" order by submitTime
-
old job keep running, until finish all the tasks. found statements by query select * form request_index where sessionId="sessionId" and statemtnState = "waiting" and appId = "oldAppId" and jobId = "oldJobId" order by submitTime
Opt-2 Plugin manage EMR-S job during BG upgrade
- Blue-Green deployment, create newApp
- Update setting with newAppId, BG=true.
- DP-Deployment monitor is scheduled to run every 30mis, and detect BG=True, newAppId.
for session in Running-Sessions if it is streaming
- old job will stop, spark job update state as DEAD.
- SQL in old job is
CREATE SKIPPING INDEX on mys3.default.http_logs - new job will start, spark job update state as RUNNING. keep update the heartbeat.
- SQL in new job is
RECOVER INDEX JOB flintJobName
if it is interactive session
- start newJob, sessionId is same is oldJob new jobs check if session.BG=true and myJobId == newJobId, it will continue running.
- set session.BG=true, session.jobId = newJobId.
- when old job finish existing work, it check the condition if session.BG=true, session.myJobId != myJobId, oldJob exit.
Use cases
-
case-1, old job failed when process the query opt-1. SessionStateMonitor should not retry failed job opt-2, oldJob detect the BG=true. does not update sessionState.
-
case-2, old job failed when CP force close the oldApp opt-1, SessionStateMonitor should not retry failed job opt-2, oldJob detect the BG=true. does not update sessionState.
-
case-3, too much tasks on old job opt-1, How long client should wait for oldJob finish? opt-2, At max, 10mins. Session will timeout.