flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-27355][runtime] Unregister JobManagerRunner after it's closed

Open kumar-mallikarjuna opened this issue 1 year ago • 7 comments

What is the purpose of the change

Prevents early unregistration of JobManagerRunner before it's closed successfully.

Brief change log

Moves the JobManagerRunner unregistration to another stage in the CompletableFuture in DefaultJobManagerRunnerRegistry.localCleanupAsync() to keep JobManagerRunner from being unregistered before it's closed successfully.

Verifying this change

The existing tests have been modified to check the new behaviour.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

kumar-mallikarjuna avatar Jul 05 '24 16:07 kumar-mallikarjuna

CI report:

  • adf1fa67bea7f5d43e9b21276e7aa7f5f5e53089 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jul 05 '24 16:07 flinkbot

@flinkbot run azure

kumar-mallikarjuna avatar Jul 05 '24 17:07 kumar-mallikarjuna

@flinkbot run azure

kumar-mallikarjuna avatar Jul 16 '24 15:07 kumar-mallikarjuna

The test testJobSubmissionUnderSameJobId fails with this change because we it expects the unregistration to be completed before the JobManagerRunner is terminated. Need to look into this.

kumar-mallikarjuna avatar Jul 31 '24 16:07 kumar-mallikarjuna

The test testJobSubmissionUnderSameJobId fails with this change because we it expects the unregistration to be completed before the JobManagerRunner is terminated. Need to look into this.

The test is fixed now.

kumar-mallikarjuna avatar Jul 31 '24 18:07 kumar-mallikarjuna

@XComp , could you please take another look? I'll work on figuring out the following meanwhile:

I'm wondering whether you could come up with a test where we verify that the close is called in another thread and the unregister is then executed in the main thread.

kumar-mallikarjuna avatar Jul 31 '24 18:07 kumar-mallikarjuna

@flinkbot run azure

kumar-mallikarjuna avatar Oct 07 '24 10:10 kumar-mallikarjuna

This PR is being marked as stale since it has not had any activity in the last 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the community, contact details can be found here: https://flink.apache.org/what-is-flink/community/

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Apr 06 '25 06:04 github-actions[bot]

This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.

github-actions[bot] avatar May 07 '25 06:05 github-actions[bot]