flink-kubernetes-operator icon indicating copy to clipboard operation
flink-kubernetes-operator copied to clipboard

[FLINK-29110] Use StatefulSet instead of Deployment to deploy JM and …

Open Grypse opened this issue 2 years ago • 14 comments

What is the purpose of the change

This pull request make an improvement for standalone mode to use StatefulSet instead of Deployment to deploy JM and TM to support mount a dynamically-created PersistentVolumeClaim.

Brief change log

  • Use StatefulSet to instead of Deployment to depoly JM and TM
  • Introduce StandaloneKubernetesJobManagerSpecification to encapsulate JM specification info
  • Add volumeClaimTemplates for JobManagerSpec and TaskManagerSpec
  • Add JOB_MANAGER_PVC_TEMPLATE and TASK_MANAGER_PVC_TEMPLATE ConfigOption to set jobmanager and taskmanager pvc template file separately.
  • Modify obersving logical for JM and TM StatefulSet.
  • Add statefulset to rbac.
  • Modify docs/content/docs/custom-resource/reference.md to describe volumeClaimTemplates settings.
  • Add standalone cluster example with pvc settings.

Verifying this change

This change is already covered by existing tests, such as .

  • Modified Fabric8FlinkStandaloneKubeClientTest to adapt StatefulSet creating test.
  • Modified KubernetesStandaloneClusterDescriptorTest to adapt StatefulSet to test standalone cluster deploying
  • Modified StandaloneFlinkServiceTest to adapt StatefulSet to test StandaloneFlinkService to deploy standalone cluster.
  • Modified StandaloneKubernetesJobManagerFactoryTest to adapt StatefulSet to test jobmanager spec.
  • Modified StandaloneKubernetesTaskManagerFactoryTest to adapt StatefulSet to test taskmanager spec

This change added tests and can be verified as follows:

  • Added TestUtils.buildStandaloneSessionCluster() and TestUtils.buildStandaloneApplicationCluster() for standalone cluster tests.
  • Added SessionObserverTest.observeStandaloneSessionCluster to verify standalone session cluster observing logical

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): ( no)
  • The public API, i.e., is any changes to the CustomResourceDescriptors: (yes )
  • Core observer or reconciler logic that is regularly executed: (yes )

Documentation

  • Does this pull request introduce a new feature? (yes )
  • If yes, how is the feature documented? (docs )

Grypse avatar Aug 29 '22 16:08 Grypse

@gyfora @wangyang0918 Please help me review this PR.

Grypse avatar Aug 30 '22 02:08 Grypse

Thank you for the PR @Grypse we will review this with @usamj

gyfora avatar Aug 30 '22 08:08 gyfora

My main concern with the Jobmanager StatefulSet change is that it introduces considerable complexity and branching in the core observer logic with very little benefit. I don't think PVCs are very important for JMs.

I would be in favor of keeping the Deployments for the JM component to be aligned with the Native implementation.

It would be great to have input from the others also as this is an important decision with major code implications :) : @wangyang0918 @tweise @morhidi @usamj

gyfora avatar Aug 30 '22 20:08 gyfora

I agree with @gyfora. and aligns with what I originally thought when I saw the change. I don't think there is enough benefit to the JM being a StatefulSet

My other concern is whether we should only use StatefulSets for the TM's or whether using StatefulSets/Deployment for them should be a configurable option. The reason for this is that recovery time for a TM failing would be longer if it used StatefulSets. It could also result in issues and user action to fix if a TM pod was stuck in a Terminating state which could be unnecessary pain for users if they aren't using PVC's with their TM's

usamj avatar Aug 30 '22 23:08 usamj

@usamj Maybe we can set three mode for flinkdeployment CR:

  1. native
  2. standalone_deployment
  3. standalone_statefulset

two of these modes are available as standalone:

  1. standalone_deployment
  2. standalone_statefulset

And volumeClaimTemplates work only in standalone_statefulset mode.

Grypse avatar Aug 31 '22 06:08 Grypse

@gyfora I believe that configuring PVC for JM is necessary in some scenarios. As @usamj said,we can set a configurable option for standalone to choose Deployment or StatefulSet.

Grypse avatar Aug 31 '22 06:08 Grypse

@usamj Maybe we can set three mode for flinkdeployment CR:

  1. native
  2. standalone_deployment
  3. standalone_statefulset

two of these modes are available as standalone:

  1. standalone_deployment
  2. standalone_statefulset

And volumeClaimTemplates work only in standalone_statefulset mode.

I think we should not complicate things even more :) If we want to make this configurable we could use a config option for it but maybe we dont need it.

@usamj would statefulset for TMs work for you? Or do you see any reasons why you would need to have this as a configurable option?

gyfora avatar Aug 31 '22 06:08 gyfora

@Grypse I understand your concern regarding the JM. Let me try to address some of your points. Your biggest argument for using StatefulSets for JM is improved recovery time with a PVC.

In my experience JM recovery time is almost never an issue, and having PVC vs remote storage have almost no difference in speed for recovering checkpoint/ha metadata as they are not that large. Furthermore I think even with the current Deployment + podTemplate architecture you can configure a PVC if you really need to.

The Flink community also decided to use Deployment instead of StatefulSet for implementing the native integration and I don't remember people hitting any issues that (but of course I could be wrong here). Based on this I don't think we need to implement this right away.

I think we shouldn't downplay the cost of increased complexity in the operator. Instead of maintaining 2 parallel observer paths we could instead focus on making the Deployment observer logic more robust and better.

In any case if we cannot agree on this we should discuss further on the dev mailing list to include the community.

gyfora avatar Aug 31 '22 06:08 gyfora

Hi Folks,

It would be really great to see what faster recovery means here, through an example. In general this sounds a fair benefit but before choosing a more complicated path here it would be great to see some obvious gains.

Would this solution benefit reactive scaling for example? Would this mean that in case of a TM failure it could recover before the job is being restarted with a lower parallelism?

@Grypse do you have any comparison that would help us decide to embrace this feature?

Thanks, Matyas

morhidi avatar Aug 31 '22 07:08 morhidi

@morhidi I think this would have no impact on reactive scaling. I think the actual benefit is the possibility to configure PVCs for TMs which could help with checkpointing/local recovery in case of TM failures.

But as @usamj pointed out, it might actually negatively impact the TM startup time itself.

gyfora avatar Aug 31 '22 07:08 gyfora

@Grypse this seems like a fairly large architecture decision that we underestimated at first. We would like to kindly ask you to prepare your proposal in a FLIP format and post it on the DEV mailing list.

Please describe the following points in the FLIP:

  • What is the benefit of using StatefulSet for TaskManagers?

    • Does this only affect local recovery? What is the performance gain compared to Deployments?
    • What benefit do we have compared to the PVC available in the podTemplate? Seems like ReadWriteMany PVCs already provide the same benefits for Deployments
    • What is the implication of TaskManager startup time? Deployments start pods faster as they dont need to wait for termination
    • Should we make Deployment / StatefulSet configurable?
  • What is the benefit of using StatefulSet for JobManagers?

    • What data would the jobmanager store in PVC? What is the performance gain?
    • Does this only affect state handles? That is very small data.
    • Why can't we do this already with the podTemplate?

gyfora avatar Aug 31 '22 08:08 gyfora

Having a FLIP to address the @gyfora's comments will make the discussion more efficient. Look forward to it with some testing data. For example, how long does it take to mount a PVC or terminate a stateful pod?

wangyang0918 avatar Aug 31 '22 12:08 wangyang0918

@gyfora @wangyang0918 @morhidi @usamj Thanks for yours review. I will make a FLIP to describe the proposal.

Grypse avatar Aug 31 '22 13:08 Grypse

@Grypse great, please post a link here to the FLIP after you posted it to the mailing list and the wiki for reference. Thanks.

mbalassi avatar Sep 13 '22 14:09 mbalassi

@Grypse are you still working on this FLIP?

gyfora avatar Oct 26 '22 12:10 gyfora

Hi @Grypse I am closing this PR due to inactivity, please feel free to re-open it once you have the FLIP for this feature up for discussion on the mailing list!

gyfora avatar Nov 09 '22 08:11 gyfora