spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Suspend/Resume feature on SparkApplication

Open everpeace opened this issue 10 months ago • 10 comments

Purpose of this PR

This PR implements Suspend/Resume operation on SparkApplication.

Proposed changes:

  • Add Suspend: true|false field in SparkApplicationSpec
  • State Transition:
    • Suspend: ** --Suspend: true--> SUSPENDING --> SUSPENDED (**=any non-Terminal Application State)
    • Resume: SUSPENDING | SUSPENDED --Suspend: false--> RESUMING --> SUBMITTED

Fixes #2290

Change Category

  • [ ] Bugfix (non-breaking change which fixes an issue)
  • [x] Feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that could affect existing functionality)
  • [ ] Documentation update

Rationale

Supporting Suspend/Resume feature could make Kueue integration easier.

Checklist

  • [x] I have conducted a self-review of my own code.
  • [x] I have updated documentation accordingly.
  • [x] I have added tests that prove my changes are effective or that my feature works.
  • [x] Existing unit tests pass locally with my changes.

Additional Notes

everpeace avatar Jan 17 '25 18:01 everpeace

@everpeace Maybe u can resolve this conflict so that this PR can continue to move forward?

we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~

Kevinz857 avatar Apr 03 '25 03:04 Kevinz857

@Kevinz857 Hi, thanks for your comment!! I rebaed my PR.

everpeace avatar Apr 09 '25 12:04 everpeace

we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~

My original SparkApplication integration PR in Kueue(https://github.com/kubernetes-sigs/kueue/pull/4032) was actually closed because

  • Kueue already has the design doc (https://github.com/kubernetes-sigs/kueue/issues/1656) how to support spark workloads (PodGroups)
  • my original PR does not follow the design doc.

However, I already opened a PR for this which will be needed to support SparkApplication integration in Kueue: https://github.com/kubernetes-sigs/kueue/pull/4102.

everpeace avatar Apr 09 '25 12:04 everpeace

/assign @jacobsalway @vara-bonthu @ImpSy

ChenYi015 avatar Apr 10 '25 02:04 ChenYi015

@Kevinz857 Would you mind help reviewing the PR?

ChenYi015 avatar Apr 10 '25 02:04 ChenYi015

@Kevinz857 Would you mind help reviewing the PR?

@ChenYi015 Very happy, I will review this PR together in detail later.

Kevinz857 avatar Apr 25 '25 16:04 Kevinz857

@everpeace Suggestions for Improvement

  1. Controller logic abstraction Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse. Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.

  2. Pod termination check reusability The logic that checks for Spark driver/executor pod termination before resuming could be encapsulated in a helper function (e.g. areSparkPodsTerminated(app)) to make the code more modular and easier to test. Optionally, adding retry limits or exponential backoff can help avoid indefinite retries in corner cases.

  3. Field documentation enhancement In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:

// Suspend indicates whether the SparkApplication should be suspended. // When true, the controller skips submitting the Spark job.

  1. Test enhancement ideas

Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true. Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.

@ChenYi015 If have time, we can take a look together to see if these suggestions are suitable

@everpeace Or if there are any better suggestions, we can discuss them together

Kevinz857 avatar Apr 25 '25 16:04 Kevinz857

@Kevinz857 Thanks for your review.

Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse.

hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it. https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/api/v1beta2/sparkapplication_types.go#L44

Perhaps, you saw ScheduledSparkApplication's suspension check: https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/internal/controller/scheduledsparkapplication/controller.go#L109-L112

Could you elaborate on your suggestion if I missed something?

Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.

hmmm. I think I already did. https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/pkg/common/event.go#L33-L38

Could you elaborate on your suggestion if I missed something?

Field documentation enhancement In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:

Thanks. Fixed in b3f8e8c

Test enhancement ideas Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.

I think it's already covered: https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/internal/controller/sparkapplication/controller_test.go#L579-L584

Could you elaborate on your suggestion if I missed something?

Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.

IIUC, the current SparkApplication unit tests(internal/controller/sparkapplication/controller_test.go) do not assert emitted events at all. Although I fully agree your suggestion(we should assert emitted events in unit tests), I think the enhancement should be done in another PR so that all the unit tests assert emitted events. Or, should this PR cover event assertions only for suspend/resume in this PR?? If so, I would be happy to handle it.

everpeace avatar Apr 30 '25 07:04 everpeace

@everpeace Thanks for your contribution !

@ChenYi015 I think the current submission can be merged. This commit will not have a destructive impact on the current core functionality.

Kevinz857 avatar May 27 '25 07:05 Kevinz857

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 25 '25 08:08 github-actions[bot]

/remove-lifecycle stale @vara-bonthu @ChenYi015 @nabuskey I think, that would be a great feature in Spark Operator APIs to natively integrate it with Kueue!

cc @kannon92 @tenzen-y @mimowo

andreyvelich avatar Sep 05 '25 00:09 andreyvelich

@vara-bonthu @ChenYi015 @nabuskey I think, that would be a great feature in Spark Operator APIs to natively integrate it with Kueue!

Awesome effort! We had multiple people asking for the integration, so I'm happy to see it rolling.

cc @kannon92 @tenzen-y @mimowo

Overall LGTM, let me comment here on the API itself, and I may also leave some small-ish comments to the code.

hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it.

I don't remember / wasn't involved with the decision when introducing the field to core k8s as a ptr.

I found this question from @soltysh in the k8s design, but there was no explicit answer: https://github.com/kubernetes/enhancements/pull/2234#discussion_r567014403, just the confirmation of the decision to use ptr in the comment.

Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.

Maybe @deads2k could validate that statement to keep me honest.

mimowo avatar Sep 05 '25 09:09 mimowo

@everpeace please rebase

mimowo avatar Sep 05 '25 09:09 mimowo

Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.

It makes sense. I changed suspend field as pointer in https://github.com/kubeflow/spark-operator/pull/2387/commits/43bcea28ad810c32379d772568ed9ef3d77b1d80.

everpeace avatar Sep 05 '25 13:09 everpeace

@mimowo all your comments are addressed. PTAL 🙇

everpeace avatar Sep 05 '25 14:09 everpeace

LGTM, thank you 👍

mimowo avatar Sep 05 '25 14:09 mimowo

I found this question from @soltysh in the k8s design, but there was no explicit answer: kubernetes/enhancements#2234 (comment), just the confirmation of the decision to use ptr in the comment.

Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.

Maybe @deads2k could validate that statement to keep me honest.

Just to clarify, since my name was called out the reason for going with *bool is two folded:

  1. The ability to differentiate from the value being set and not.
  2. Most importantly, the k8s API guidelines force all optional fields to be pointers.

Hope that helps.

soltysh avatar Sep 05 '25 17:09 soltysh

LGTM, @andreyvelich what are the next steps for merging / releasing this enhancement?

mimowo avatar Sep 11 '25 06:09 mimowo

/assign @nabuskey @vara-bonthu @ChenYi015 @yuchaoran2011 Please can you help with reviewing this PR?

We see a lot of interest of native integration between Kueue and SparkApplication CRD.

andreyvelich avatar Sep 11 '25 12:09 andreyvelich

Hi folks @andreyvelich @nabuskey @vara-bonthu @ChenYi015 @yuchaoran2011, friendly ping.

mimowo avatar Oct 02 '25 08:10 mimowo

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vara-bonthu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Oct 06 '25 05:10 google-oss-prow[bot]

@andreyvelich @everpeace @mimowo tnak you for driving this!

tenzen-y avatar Oct 06 '25 05:10 tenzen-y