flink-kubernetes-operator
flink-kubernetes-operator copied to clipboard
[FLINK-29110] Use StatefulSet instead of Deployment to deploy JM and …
What is the purpose of the change
This pull request make an improvement for standalone mode to use StatefulSet instead of Deployment to deploy JM and TM to support mount a dynamically-created PersistentVolumeClaim.
Brief change log
- Use StatefulSet to instead of Deployment to depoly JM and TM
- Introduce StandaloneKubernetesJobManagerSpecification to encapsulate JM specification info
- Add volumeClaimTemplates for JobManagerSpec and TaskManagerSpec
- Add JOB_MANAGER_PVC_TEMPLATE and TASK_MANAGER_PVC_TEMPLATE ConfigOption to set jobmanager and taskmanager pvc template file separately.
- Modify obersving logical for JM and TM StatefulSet.
- Add statefulset to rbac.
- Modify docs/content/docs/custom-resource/reference.md to describe volumeClaimTemplates settings.
- Add standalone cluster example with pvc settings.
Verifying this change
This change is already covered by existing tests, such as .
- Modified Fabric8FlinkStandaloneKubeClientTest to adapt StatefulSet creating test.
- Modified KubernetesStandaloneClusterDescriptorTest to adapt StatefulSet to test standalone cluster deploying
- Modified StandaloneFlinkServiceTest to adapt StatefulSet to test StandaloneFlinkService to deploy standalone cluster.
- Modified StandaloneKubernetesJobManagerFactoryTest to adapt StatefulSet to test jobmanager spec.
- Modified StandaloneKubernetesTaskManagerFactoryTest to adapt StatefulSet to test taskmanager spec
This change added tests and can be verified as follows:
- Added TestUtils.buildStandaloneSessionCluster() and TestUtils.buildStandaloneApplicationCluster() for standalone cluster tests.
- Added SessionObserverTest.observeStandaloneSessionCluster to verify standalone session cluster observing logical
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): ( no)
- The public API, i.e., is any changes to the
CustomResourceDescriptors
: (yes ) - Core observer or reconciler logic that is regularly executed: (yes )
Documentation
- Does this pull request introduce a new feature? (yes )
- If yes, how is the feature documented? (docs )
@gyfora @wangyang0918 Please help me review this PR.
Thank you for the PR @Grypse we will review this with @usamj
My main concern with the Jobmanager StatefulSet change is that it introduces considerable complexity and branching in the core observer logic with very little benefit. I don't think PVCs are very important for JMs.
I would be in favor of keeping the Deployments for the JM component to be aligned with the Native implementation.
It would be great to have input from the others also as this is an important decision with major code implications :) : @wangyang0918 @tweise @morhidi @usamj
I agree with @gyfora. and aligns with what I originally thought when I saw the change. I don't think there is enough benefit to the JM being a StatefulSet
My other concern is whether we should only use StatefulSets for the TM's or whether using StatefulSets/Deployment for them should be a configurable option. The reason for this is that recovery time for a TM failing would be longer if it used StatefulSets. It could also result in issues and user action to fix if a TM pod was stuck in a Terminating state which could be unnecessary pain for users if they aren't using PVC's with their TM's
@usamj Maybe we can set three mode for flinkdeployment CR:
- native
- standalone_deployment
- standalone_statefulset
two of these modes are available as standalone:
- standalone_deployment
- standalone_statefulset
And volumeClaimTemplates work only in standalone_statefulset
mode.
@gyfora I believe that configuring PVC for JM is necessary in some scenarios. As @usamj said,we can set a configurable option for standalone to choose Deployment or StatefulSet.
@usamj Maybe we can set three mode for flinkdeployment CR:
- native
- standalone_deployment
- standalone_statefulset
two of these modes are available as standalone:
- standalone_deployment
- standalone_statefulset
And volumeClaimTemplates work only in
standalone_statefulset
mode.
I think we should not complicate things even more :) If we want to make this configurable we could use a config option for it but maybe we dont need it.
@usamj would statefulset for TMs work for you? Or do you see any reasons why you would need to have this as a configurable option?
@Grypse I understand your concern regarding the JM. Let me try to address some of your points. Your biggest argument for using StatefulSets for JM is improved recovery time with a PVC.
In my experience JM recovery time is almost never an issue, and having PVC vs remote storage have almost no difference in speed for recovering checkpoint/ha metadata as they are not that large. Furthermore I think even with the current Deployment + podTemplate architecture you can configure a PVC if you really need to.
The Flink community also decided to use Deployment instead of StatefulSet for implementing the native integration and I don't remember people hitting any issues that (but of course I could be wrong here). Based on this I don't think we need to implement this right away.
I think we shouldn't downplay the cost of increased complexity in the operator. Instead of maintaining 2 parallel observer paths we could instead focus on making the Deployment observer logic more robust and better.
In any case if we cannot agree on this we should discuss further on the dev mailing list to include the community.
Hi Folks,
It would be really great to see what faster
recovery means here, through an example. In general this sounds a fair benefit but before choosing a more complicated path here it would be great to see some obvious gains.
Would this solution benefit reactive scaling for example? Would this mean that in case of a TM failure it could recover before the job is being restarted with a lower parallelism?
@Grypse do you have any comparison that would help us decide to embrace this feature?
Thanks, Matyas
@morhidi I think this would have no impact on reactive scaling. I think the actual benefit is the possibility to configure PVCs for TMs which could help with checkpointing/local recovery in case of TM failures.
But as @usamj pointed out, it might actually negatively impact the TM startup time itself.
@Grypse this seems like a fairly large architecture decision that we underestimated at first. We would like to kindly ask you to prepare your proposal in a FLIP format and post it on the DEV mailing list.
Please describe the following points in the FLIP:
-
What is the benefit of using StatefulSet for TaskManagers?
- Does this only affect local recovery? What is the performance gain compared to Deployments?
- What benefit do we have compared to the PVC available in the podTemplate? Seems like ReadWriteMany PVCs already provide the same benefits for Deployments
- What is the implication of TaskManager startup time? Deployments start pods faster as they dont need to wait for termination
- Should we make Deployment / StatefulSet configurable?
-
What is the benefit of using StatefulSet for JobManagers?
- What data would the jobmanager store in PVC? What is the performance gain?
- Does this only affect state handles? That is very small data.
- Why can't we do this already with the podTemplate?
Having a FLIP to address the @gyfora's comments will make the discussion more efficient. Look forward to it with some testing data. For example, how long does it take to mount a PVC or terminate a stateful pod?
@gyfora @wangyang0918 @morhidi @usamj Thanks for yours review. I will make a FLIP to describe the proposal.
@Grypse great, please post a link here to the FLIP after you posted it to the mailing list and the wiki for reference. Thanks.
@Grypse are you still working on this FLIP?
Hi @Grypse I am closing this PR due to inactivity, please feel free to re-open it once you have the FLIP for this feature up for discussion on the mailing list!