flink-on-k8s-operator icon indicating copy to clipboard operation
flink-on-k8s-operator copied to clipboard

Flink Operator Loses Job Manager Contact during EKS upgrade

Open guruguha opened this issue 1 year ago • 10 comments

We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!

When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.

Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.

guruguha avatar May 11 '23 02:05 guruguha

Hi @guruguha Are you sure you set all the high availability related flink properties?

Also, the serviceAccount with which the FlinkCluster runs should have permissions to create and edit configmaps in the corresponding namespace. You can verify that you have all the right properties set and the serviceaccount has all the roles needed by checking if there exists a configmap in the same namespace as the FlinkCluster called: <YOUR_FLINK_CLUSTER>-cluster-config-map when the FlinkCluster starts.

live-wire avatar May 23 '23 18:05 live-wire

@live-wire thanks for responding! Yes, we have enabled HA for all our Flink clusters. All of them are Application Clusters requiring a job submitter to run/get deployed. Yes, we see the config maps created as well in the k8s namespace as well.

guruguha avatar May 23 '23 21:05 guruguha

@guruguha would you be able to try this version https://github.com/spotify/flink-on-k8s-operator/releases/tag/v0.5.1-alpha.2? We recently addressed issues related to HA that might be impacting you.

regadas avatar May 23 '23 23:05 regadas

@regadas We recently upgraded our operator to v0.5.0 release. The issue still persists. Let me check with v0.5.1-alpha.2 tag.

guruguha avatar May 23 '23 23:05 guruguha

@regadas We tried with the above release - our application specific config map got deleted and all the pods were terminated during a node roll on EKS.

guruguha avatar May 24 '23 22:05 guruguha

Hey @guruguha The configmap is what helps the job recover. That shouldn't be deleted so the job can recover. Can you try to delete the job manager pod for a running job and see if it recovers? Also please share your FlinkCluster yaml so we can try to reproduce this scenario for you.

live-wire avatar May 30 '23 11:05 live-wire

@live-wire Thanks for responding. We have a brief on this issue here: https://github.com/spotify/flink-on-k8s-operator/discussions/690 Also, it would be great if you could share ideal / must have HA configs for a Flink cluster so we can compare with what we have right now.

At a high level, these are our HA configs:

flinkProperties:
    high-availability.storageDir: s3://PATH_TO_S3_DIR/
    s3.iam-role: ROLE_TO_ACCESS_S3

Other configs:

  job:
    autoSavepointSeconds: 3600
    savepointsDir: dummy_path
    restartPolicy: FromSavepointOnFailure
    maxStateAgeToRestoreSeconds: 21600
    takeSavepointOnUpdate: true

guruguha avatar May 30 '23 21:05 guruguha

Hey @guruguha These are the required HA flinkProperties for 1.16: link (Depends on the Flink version)

kubernetes.cluster-id: <cluster-id>
high-availability: kubernetes
high-availability.storageDir: hdfs:///flink/recovery

live-wire avatar May 31 '23 15:05 live-wire

Also, there should be no job-submitter pod when you use the Application mode. I noticed you mentioned you need a jobSubmitter? Can you share your FlinkCluster yaml?

live-wire avatar May 31 '23 15:05 live-wire

@live-wire I might have misunderstood the application mode with per-job mode. I’ll share the flinkcluster yaml shortly.

We do have this in our HA settings:

    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    kubernetes.cluster-id: flink-ingestion-std-ha

guruguha avatar May 31 '23 15:05 guruguha