flink-on-k8s-operator
flink-on-k8s-operator copied to clipboard
Multiple job-submitters spawn after a redeploy of flink job
When deploying an update to a flink job, I am seeing the operator spawn multiple job-submitter pods, like shown:
flink-fastlane-streaming-job-submitter--1-4pgnx 0/2 Terminating 0 4s
flink-fastlane-streaming-job-submitter--1-6d896 0/2 Terminating 0 7s
flink-fastlane-streaming-job-submitter--1-78c75 0/2 Pending 0 1s
flink-fastlane-streaming-job-submitter--1-8hcwd 0/2 Terminating 0 2s
flink-fastlane-streaming-job-submitter--1-8qpqb 0/2 Terminating 0 4s
flink-fastlane-streaming-job-submitter--1-bzn7c 0/2 Terminating 0 3s
flink-fastlane-streaming-job-submitter--1-fsgqf 0/2 Terminating 0 5s
flink-fastlane-streaming-job-submitter--1-g4gh9 0/2 Terminating 0 4s
flink-fastlane-streaming-job-submitter--1-r627n 0/2 Terminating 0 5s
flink-fastlane-streaming-job-submitter--1-tkpng 0/2 Terminating 0 3s
flink-fastlane-streaming-job-submitter--1-tmfpp 0/2 Terminating 0 2s
flink-fastlane-streaming-job-submitter--1-v6km5 0/2 Terminating 0 4s
flink-fastlane-streaming-job-submitter--1-v8q8j 0/2 Terminating 0 15s
flink-fastlane-streaming-job-submitter--1-vc9vl 0/2 Terminating 0 2s
flink-fastlane-streaming-job-submitter--1-wmdvd 0/2 Terminating 0 2s
flink-fastlane-streaming-job-submitter--1-zt9nx 0/2 Terminating 0 3s
flink-fastlane-streaming-jobmanager-0 2/2 Running 0 20m
flink-fastlane-streaming-taskmanager-0 2/2 Running 0 20m
flink-fastlane-streaming-taskmanager-1 2/2 Running 0 20m
the operator has the following logs repeated:
"level":"info","ts":1652458163.5198195,"logger":"controllers.FlinkCluster","msg":"Observed Flink job savepoint status","cluster":"rad-dev/flink-fastlane-streaming","status":null}
{"level":"info","ts":1652458163.5198462,"logger":"controllers.FlinkCluster","msg":"Observed persistent volume claim list","cluster":"rad-dev/flink-fastlane-streaming","state":0}
{"level":"info","ts":1652458164.2241309,"logger":"controllers.FlinkCluster","msg":"Failed to get Flink job exceptions.","cluster":"rad-dev/flink-fastlane-streaming","error":"Get \"http://flink-fastlane-streaming-jobmanager.rad-dev.svc.cluster.local:8081/jobs//exceptions\": 400 Bad Request"}
{"level":"info","ts":1652458164.2248478,"logger":"controllers.FlinkCluster","msg":"---------- 2. Update cluster status ----------","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652458164.2248688,"logger":"controllers.FlinkCluster","msg":"No status change","cluster":"rad-dev/flink-fastlane-streaming","state":"Updating"}
{"level":"info","ts":1652458164.224872,"logger":"controllers.FlinkCluster","msg":"---------- 3. Compute the desired state ----------","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1193507,"logger":"controllers.FlinkCluster","msg":"---------- 4. Take actions ----------","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1193545,"logger":"controllers.FlinkCluster","msg":"The cluster update is in progress","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.119358,"logger":"controllers.FlinkCluster","msg":"ConfigMap already exists, no action","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1193655,"logger":"controllers.FlinkCluster","msg":"StatefulSet already exists, no action","cluster":"rad-dev/flink-fastlane-streaming","component":"JobManager"}
{"level":"info","ts":1652457717.1193697,"logger":"controllers.FlinkCluster","msg":"JobManager service already exists, no action","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1193748,"logger":"controllers.FlinkCluster","msg":"StatefulSet already exists, no action","cluster":"rad-dev/flink-fastlane-streaming","component":"TaskManager"}
{"level":"info","ts":1652457717.1193848,"logger":"controllers.FlinkCluster","msg":"Deploying Flink job","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1193864,"logger":"controllers.FlinkCluster","msg":"Updating job status to proceed creating new job submitter","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1356893,"logger":"controllers.FlinkCluster","msg":"Succeeded to update job status for new job submitter.","cluster":"rad-dev/flink-fastlane-streaming","job status":{"name":"com.wn.rad.fastlane.streaming.job.fastlanestreamingjob","submitterName":"flink-fastlane-streaming-job-submitter","state":"Updating","fromSavepoint":"s3://my-flink-state-bucket/flink/savepoints/rad-dev/flink-fastlane-streaming/savepoint-xxxxxx-yyyyyyyyyyyy","savepointGeneration":1,"savepointLocation":"s3://my-flink-state-bucket/flink/savepoints/rad-dev/flink-fastlane-streaming/savepoint-xxxxxx-yyyyyyyyyyyy","savepointTime":"2022-05-13T15:32:41Z","finalSavepoint":true,"deployTime":"2022-05-13T16:01:57Z"}}
{"level":"info","ts":1652457717.13571,"logger":"controllers.FlinkCluster","msg":"Found old job submitter","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1357136,"logger":"controllers.FlinkCluster","msg":"Deleting job submitter","cluster":"rad-dev/flink-fastlane-streaming","job":{"apiVersion":"batch/v1","kind":"Job","namespace":"rad-dev","name":"flink-fastlane-streaming-job-submitter"}}
{"level":"info","ts":1652457717.1423402,"logger":"controllers.FlinkCluster","msg":"Job submitter deleted","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1423507,"logger":"controllers.FlinkCluster","msg":"Creating job submitter","cluster":"rad-dev/flink-fastlane-streaming","resource":{"metadata":{"name":"flink-fastlane-streaming-job-submitter","namespace":"rad-dev","creationTimestamp":null,"labels":{"app":"flink","cluster":"flink-fastlane-streaming","flinkoperator.k8s.io/revision-name":"flink-fastlane-streaming-7df6448bf8"},"ownerReferences":[{"apiVersion":"flinkoperator.k8s.io/v1beta1","kind":"FlinkCluster","name":"flink-fastlane-streaming","uid":"f5d9774b-a238-41d5-aa82-81953a4251a5","controller":true,"blockOwnerDeletion":false}]},"spec":{"backoffLimit":0,"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"flink","cluster":"flink-fastlane-streaming","flinkoperator.k8s.io/revision-name":"flink-fastlane-streaming-7df6448bf8"},"annotations":{...}},"spec":{...},"status":{}}}
{"level":"info","ts":1652457717.1604686,"logger":"controllers.FlinkCluster","msg":"Job submitter created","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652457717.1604812,"logger":"controllers.FlinkCluster","msg":"Requeue reconcile request","cluster":"rad-dev/flink-fastlane-streaming","after":10}
here's the status of the FlinkCluster in question:
status:
components:
configMap:
name: flink-fastlane-streaming-configmap
state: Ready
job:
deployTime: "2022-05-13T16:04:56Z"
finalSavepoint: true
fromSavepoint: s3://my-flink-state-bucket/flink/savepoints/rad-dev/flink-fastlane-streaming/savepoint-xxxxxx-yyyyyyyyyyyy
name: com.wn.rad.fastlane.streaming.job.fastlanestreamingjob
savepointGeneration: 1
savepointLocation: s3://my-flink-state-bucket/flink/savepoints/rad-dev/flink-fastlane-streaming/savepoint-xxxxxx-yyyyyyyyyyyy
savepointTime: "2022-05-13T15:32:41Z"
state: Updating
submitterName: flink-fastlane-streaming-job-submitter
jobManagerService:
name: flink-fastlane-streaming-jobmanager
state: Ready
jobManagerStatefulSet:
name: flink-fastlane-streaming-jobmanager
state: Ready
taskManagerStatefulSet:
name: flink-fastlane-streaming-taskmanager
state: Ready
lastUpdateTime: "2022-05-13T16:04:56Z"
revision:
currentRevision: flink-fastlane-streaming-5554ff8749-1
nextRevision: flink-fastlane-streaming-7df6448bf8-2
savepoint:
jobID: b72f5c1fde28b6d95e0507f694c87ca3
requestTime: "2022-05-13T15:32:36Z"
state: Succeeded
triggerID: 2274c53b816302e16702324e5d1557c0
triggerReason: update
triggerTime: "2022-05-13T15:32:36Z"
state: Updating
it would appear that this message is indicative of the problem; I'm currently experimenting around that area:
{"level":"info","ts":1652458164.2241309,"logger":"controllers.FlinkCluster","msg":"Failed to get Flink job exceptions.","cluster":"rad-dev/flink-fastlane-streaming","error":"Get \"http://flink-fastlane-streaming-jobmanager.rad-dev.svc.cluster.local:8081/jobs//exceptions\": 400 Bad Request"}
another interesting note: I see messages similar to requeueAfter:5
with subsequent messages milliseconds later:
{"level":"info","ts":1652460380.8619297,"logger":"controllers.FlinkCluster","msg":"Wait status to be stable before taking further actions.","cluster":"rad-dev/flink-fastlane-streaming","requeueAfter":5}
{"level":"info","ts":1652460380.8626342,"logger":"controllers.FlinkCluster","msg":"============================================================","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652460380.8626447,"logger":"controllers.FlinkCluster","msg":"---------- 1. Observe the current state ----------","cluster":"rad-dev/flink-fastlane-streaming"}
{"level":"info","ts":1652460380.8626738,"logger":"controllers.FlinkCluster","msg":"Observed cluster","cluster":"rad-dev/flink-fastlane-streaming","cluster":{"kind":"FlinkCluster","apiVersion":"flinkoperator.k8s.io/v1beta1","metadata":{"na
This might have been addressed by #401
unfortunately, it seems that the log message was a red herring; I'm still experiencing the same issue with the #401 patch applied.
We still see this behavior after the fix in v4.0.1-beta1. @jaredstehler, do you see the same?
Yes I still see the same. Additionally, I am having issues using job control annotations on my FlinkCluster instances; they aren't taking effect.