[Feature][Zeta] Implement force stop functionality for jobs
Fixes: https://github.com/apache/seatunnel/issues/9995
Purpose of this pull request
This PR introduces the force stop functionality for jobs in the SeaTunnel Zeta Engine.
Does this PR introduce any user-facing change?
Yes. You can now force-stop jobs through the REST API. Please refer to the documentation.
How was this patch tested?
Check list
- [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
- [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
- [ ] If you are contributing the connector code, please check that the following files are updated:
- Update plugin-mapping.properties and add new connector information in it
- Update the pom file of seatunnel-dist
- Add ci label in label-scope-conf
- Add e2e testcase in seatunnel-e2e
- Update connector plugin_config
Hi @zhangshenghang @davidzollo . PTAL when you have time, thanks!
Hi @zhangshenghang @davidzollo . PTAL when you have time, thanks!
@dybyte hi,check why the CI failed
Hi @zhangshenghang @davidzollo . PTAL when you have time, thanks!
@dybyte hi,check why the CI failed
CI has passed.
@zhangshenghang Sorry, I think this needs a bit more work, so I'll convert it to draft for now..
I organized my thoughts and submitted an updated change. Sorry for the confusion — this issue is a bit subtle and tricky.
Hi @zhangshenghang. PTAL when you have time. Thanks!
I have a question about the behavior when a job is already stuck in the DOING_SAVEPOINT state. In that case, can stopPipelineWithCheckpointFallback always successfully stop the job and release the slot resources, or are there still situations where the job may remain stuck in DOING_SAVEPOINT?
``` if (jobMaster.getCheckpointManager().isCompletedPipeline(pipelineId)) {
forcePipelineFinish();
}```
Conceptually, what we wanted here is a “force pause” of the job. But in the current implementation, the force option seems to force end the job (eg set it to CANCELED) instead of pausing it. From your point of view, does a forced termination really count as a “pause”?
@dybyte
I have a question about the behavior when a job is already stuck in the DOING_SAVEPOINT state. In that case, can stopPipelineWithCheckpointFallback always successfully stop the job and release the slot resources, or are there still situations where the job may remain stuck in DOING_SAVEPOINT?
``` if (jobMaster.getCheckpointManager().isCompletedPipeline(pipelineId)) { forcePipelineFinish(); }```Conceptually, what we wanted here is a “force pause” of the job. But in the current implementation, the force option seems to force end the job (eg set it to CANCELED) instead of pausing it. From your point of view, does a forced termination really count as a “pause”?
@dybyte
From my understanding, the main purpose of this feature is to forcefully terminate a job that is stuck in an certain state, so that it does not continue holding slot resources indefinitely. For that reason, the implementation focuses on ending the job rather than pausing it.
Regarding the job being stuck in the DOING_SAVEPOINT state, the reporter did not provide detailed logs, so it’s difficult to identify the exact root cause. My assumption is that it may be due to an issue during the savepoint-writing process, or not receiving the termination signal correctly.
Except for extreme cases such as deadlocks, I believe the current logic should be able to successfully terminate the job and release its slot resources.
Please let me know if there is anything I might have overlooked. Thank you!
@dybyte Would it be possible to add shell script support for force-cancel job operation? The Java client already supports --force-cancel / -fcancel parameter. Thank you!