flink-on-k8s-operator
flink-on-k8s-operator copied to clipboard
Flink Operator Loses Job Manager Contact during EKS upgrade
We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!
When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.
Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.
Hi @guruguha Are you sure you set all the high availability related flink properties?
Also, the serviceAccount with which the FlinkCluster
runs should have permissions to create and edit configmaps in the corresponding namespace. You can verify that you have all the right properties set and the serviceaccount has all the roles needed by checking if there exists a configmap in the same namespace as the FlinkCluster
called: <YOUR_FLINK_CLUSTER>-cluster-config-map
when the FlinkCluster starts.
@live-wire thanks for responding! Yes, we have enabled HA for all our Flink clusters. All of them are Application Clusters requiring a job submitter to run/get deployed. Yes, we see the config maps created as well in the k8s namespace as well.
@guruguha would you be able to try this version https://github.com/spotify/flink-on-k8s-operator/releases/tag/v0.5.1-alpha.2? We recently addressed issues related to HA that might be impacting you.
@regadas We recently upgraded our operator to v0.5.0 release. The issue still persists. Let me check with v0.5.1-alpha.2 tag.
@regadas We tried with the above release - our application specific config map got deleted and all the pods were terminated during a node roll on EKS.
Hey @guruguha The configmap is what helps the job recover. That shouldn't be deleted so the job can recover. Can you try to delete the job manager pod for a running job and see if it recovers? Also please share your FlinkCluster yaml so we can try to reproduce this scenario for you.
@live-wire Thanks for responding. We have a brief on this issue here: https://github.com/spotify/flink-on-k8s-operator/discussions/690 Also, it would be great if you could share ideal / must have HA configs for a Flink cluster so we can compare with what we have right now.
At a high level, these are our HA configs:
flinkProperties:
high-availability.storageDir: s3://PATH_TO_S3_DIR/
s3.iam-role: ROLE_TO_ACCESS_S3
Other configs:
job:
autoSavepointSeconds: 3600
savepointsDir: dummy_path
restartPolicy: FromSavepointOnFailure
maxStateAgeToRestoreSeconds: 21600
takeSavepointOnUpdate: true
Hey @guruguha These are the required HA flinkProperties for 1.16: link (Depends on the Flink version)
kubernetes.cluster-id: <cluster-id>
high-availability: kubernetes
high-availability.storageDir: hdfs:///flink/recovery
Also, there should be no job-submitter pod when you use the Application mode. I noticed you mentioned you need a jobSubmitter? Can you share your FlinkCluster yaml?
@live-wire I might have misunderstood the application mode with per-job mode. I’ll share the flinkcluster yaml shortly.
We do have this in our HA settings:
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
kubernetes.cluster-id: flink-ingestion-std-ha