spark-operator
spark-operator copied to clipboard
Suspend/Resume feature on SparkApplication
Purpose of this PR
This PR implements Suspend/Resume operation on SparkApplication.
Proposed changes:
- Add
Suspend: true|falsefield inSparkApplicationSpec - State Transition:
- Suspend:
**--Suspend: true-->SUSPENDING-->SUSPENDED(**=any non-Terminal Application State) - Resume:
SUSPENDING | SUSPENDED--Suspend: false-->RESUMING-->SUBMITTED
- Suspend:
Fixes #2290
Change Category
- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] Feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that could affect existing functionality)
- [ ] Documentation update
Rationale
Supporting Suspend/Resume feature could make Kueue integration easier.
Checklist
- [x] I have conducted a self-review of my own code.
- [x] I have updated documentation accordingly.
- [x] I have added tests that prove my changes are effective or that my feature works.
- [x] Existing unit tests pass locally with my changes.
Additional Notes
@everpeace Maybe u can resolve this conflict so that this PR can continue to move forward?
we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~
@Kevinz857 Hi, thanks for your comment!! I rebaed my PR.
we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~
My original SparkApplication integration PR in Kueue(https://github.com/kubernetes-sigs/kueue/pull/4032) was actually closed because
- Kueue already has the design doc (https://github.com/kubernetes-sigs/kueue/issues/1656) how to support spark workloads (PodGroups)
- my original PR does not follow the design doc.
However, I already opened a PR for this which will be needed to support SparkApplication integration in Kueue: https://github.com/kubernetes-sigs/kueue/pull/4102.
/assign @jacobsalway @vara-bonthu @ImpSy
@Kevinz857 Would you mind help reviewing the PR?
@Kevinz857 Would you mind help reviewing the PR?
@ChenYi015 Very happy, I will review this PR together in detail later.
@everpeace Suggestions for Improvement
-
Controller logic abstraction Consider extracting the suspension check
(if app.Spec.Suspend != nil && *app.Spec.Suspend)into a helper function likeisSuspended(app)for clarity and potential reuse. Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability. -
Pod termination check reusability The logic that checks for Spark driver/executor pod termination before resuming could be encapsulated in a helper function (e.g.
areSparkPodsTerminated(app)) to make the code more modular and easier to test. Optionally, adding retry limits or exponential backoff can help avoid indefinite retries in corner cases. -
Field documentation enhancement In
sparkapplication_types.go, theSuspendfield comment could be more descriptive, such as:
// Suspend indicates whether the SparkApplication should be suspended. // When true, the controller skips submitting the Spark job.
- Test enhancement ideas
Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.
Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.
@ChenYi015 If have time, we can take a look together to see if these suggestions are suitable
@everpeace Or if there are any better suggestions, we can discuss them together
@Kevinz857 Thanks for your review.
Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse.
hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it.
https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/api/v1beta2/sparkapplication_types.go#L44
Perhaps, you saw ScheduledSparkApplication's suspension check:
https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/internal/controller/scheduledsparkapplication/controller.go#L109-L112
Could you elaborate on your suggestion if I missed something?
Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.
hmmm. I think I already did. https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/pkg/common/event.go#L33-L38
Could you elaborate on your suggestion if I missed something?
Field documentation enhancement In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:
Thanks. Fixed in b3f8e8c
Test enhancement ideas Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.
I think it's already covered: https://github.com/kubeflow/spark-operator/blob/4d6dd1d2c6267e54e6518635b96e8919a49d813f/internal/controller/sparkapplication/controller_test.go#L579-L584
Could you elaborate on your suggestion if I missed something?
Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.
IIUC, the current SparkApplication unit tests(internal/controller/sparkapplication/controller_test.go) do not assert emitted events at all. Although I fully agree your suggestion(we should assert emitted events in unit tests), I think the enhancement should be done in another PR so that all the unit tests assert emitted events. Or, should this PR cover event assertions only for suspend/resume in this PR?? If so, I would be happy to handle it.
@everpeace Thanks for your contribution !
@ChenYi015 I think the current submission can be merged. This commit will not have a destructive impact on the current core functionality.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale @vara-bonthu @ChenYi015 @nabuskey I think, that would be a great feature in Spark Operator APIs to natively integrate it with Kueue!
cc @kannon92 @tenzen-y @mimowo
@vara-bonthu @ChenYi015 @nabuskey I think, that would be a great feature in Spark Operator APIs to natively integrate it with Kueue!
Awesome effort! We had multiple people asking for the integration, so I'm happy to see it rolling.
cc @kannon92 @tenzen-y @mimowo
Overall LGTM, let me comment here on the API itself, and I may also leave some small-ish comments to the code.
hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it.
I don't remember / wasn't involved with the decision when introducing the field to core k8s as a ptr.
I found this question from @soltysh in the k8s design, but there was no explicit answer: https://github.com/kubernetes/enhancements/pull/2234#discussion_r567014403, just the confirmation of the decision to use ptr in the comment.
Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.
Maybe @deads2k could validate that statement to keep me honest.
@everpeace please rebase
Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.
It makes sense. I changed suspend field as pointer in https://github.com/kubeflow/spark-operator/pull/2387/commits/43bcea28ad810c32379d772568ed9ef3d77b1d80.
@mimowo all your comments are addressed. PTAL 🙇
LGTM, thank you 👍
I found this question from @soltysh in the k8s design, but there was no explicit answer: kubernetes/enhancements#2234 (comment), just the confirmation of the decision to use ptr in the comment.
Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.
Maybe @deads2k could validate that statement to keep me honest.
Just to clarify, since my name was called out the reason for going with *bool is two folded:
- The ability to differentiate from the value being set and not.
- Most importantly, the k8s API guidelines force all optional fields to be pointers.
Hope that helps.
LGTM, @andreyvelich what are the next steps for merging / releasing this enhancement?
/assign @nabuskey @vara-bonthu @ChenYi015 @yuchaoran2011 Please can you help with reviewing this PR?
We see a lot of interest of native integration between Kueue and SparkApplication CRD.
Hi folks @andreyvelich @nabuskey @vara-bonthu @ChenYi015 @yuchaoran2011, friendly ping.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: vara-bonthu
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [vara-bonthu]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@andreyvelich @everpeace @mimowo tnak you for driving this!