seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Feature][Zeta] Implement force stop functionality for jobs

Open dybyte opened this issue 1 month ago • 8 comments

Fixes: https://github.com/apache/seatunnel/issues/9995

Purpose of this pull request

This PR introduces the force stop functionality for jobs in the SeaTunnel Zeta Engine.

Does this PR introduce any user-facing change?

Yes. You can now force-stop jobs through the REST API. Please refer to the documentation.

How was this patch tested?

Check list

  • [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
  • [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
  • [ ] If you are contributing the connector code, please check that the following files are updated:
    1. Update plugin-mapping.properties and add new connector information in it
    2. Update the pom file of seatunnel-dist
    3. Add ci label in label-scope-conf
    4. Add e2e testcase in seatunnel-e2e
    5. Update connector plugin_config

dybyte avatar Nov 17 '25 16:11 dybyte

Hi @zhangshenghang @davidzollo . PTAL when you have time, thanks!

dybyte avatar Nov 25 '25 17:11 dybyte

Hi @zhangshenghang @davidzollo . PTAL when you have time, thanks!

@dybyte hi,check why the CI failed

zhangshenghang avatar Nov 26 '25 01:11 zhangshenghang

Hi @zhangshenghang @davidzollo . PTAL when you have time, thanks!

@dybyte hi,check why the CI failed

CI has passed.

dybyte avatar Nov 26 '25 06:11 dybyte

@zhangshenghang Sorry, I think this needs a bit more work, so I'll convert it to draft for now..

dybyte avatar Nov 27 '25 08:11 dybyte

I organized my thoughts and submitted an updated change. Sorry for the confusion — this issue is a bit subtle and tricky.

dybyte avatar Nov 28 '25 12:11 dybyte

Hi @zhangshenghang. PTAL when you have time. Thanks!

dybyte avatar Nov 28 '25 23:11 dybyte

I have a question about the behavior when a job is already stuck in the DOING_SAVEPOINT state. In that case, can stopPipelineWithCheckpointFallback always successfully stop the job and release the slot resources, or are there still situations where the job may remain stuck in DOING_SAVEPOINT?

   ``` if (jobMaster.getCheckpointManager().isCompletedPipeline(pipelineId)) {
        forcePipelineFinish();
        }```

Conceptually, what we wanted here is a “force pause” of the job. But in the current implementation, the force option seems to force end the job (eg set it to CANCELED) instead of pausing it. From your point of view, does a forced termination really count as a “pause”?

@dybyte

corgy-w avatar Dec 10 '25 13:12 corgy-w

I have a question about the behavior when a job is already stuck in the DOING_SAVEPOINT state. In that case, can stopPipelineWithCheckpointFallback always successfully stop the job and release the slot resources, or are there still situations where the job may remain stuck in DOING_SAVEPOINT?

   ``` if (jobMaster.getCheckpointManager().isCompletedPipeline(pipelineId)) {
        forcePipelineFinish();
        }```

Conceptually, what we wanted here is a “force pause” of the job. But in the current implementation, the force option seems to force end the job (eg set it to CANCELED) instead of pausing it. From your point of view, does a forced termination really count as a “pause”?

@dybyte

From my understanding, the main purpose of this feature is to forcefully terminate a job that is stuck in an certain state, so that it does not continue holding slot resources indefinitely. For that reason, the implementation focuses on ending the job rather than pausing it.

Regarding the job being stuck in the DOING_SAVEPOINT state, the reporter did not provide detailed logs, so it’s difficult to identify the exact root cause. My assumption is that it may be due to an issue during the savepoint-writing process, or not receiving the termination signal correctly. Except for extreme cases such as deadlocks, I believe the current logic should be able to successfully terminate the job and release its slot resources.

Please let me know if there is anything I might have overlooked. Thank you!

dybyte avatar Dec 10 '25 14:12 dybyte

@dybyte Would it be possible to add shell script support for force-cancel job operation? The Java client already supports --force-cancel / -fcancel parameter. Thank you!

corgy-w avatar Dec 16 '25 10:12 corgy-w