[FLINK-27355][runtime] Unregister JobManagerRunner after it's closed
What is the purpose of the change
Prevents early unregistration of JobManagerRunner before it's closed successfully.
Brief change log
Moves the JobManagerRunner unregistration to another stage in the CompletableFuture in DefaultJobManagerRunnerRegistry.localCleanupAsync() to keep JobManagerRunner from being unregistered before it's closed successfully.
Verifying this change
The existing tests have been modified to check the new behaviour.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving): (no) - The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
- The S3 file system connector: (no)
Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
CI report:
- adf1fa67bea7f5d43e9b21276e7aa7f5f5e53089 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
@flinkbot run azure
@flinkbot run azure
The test testJobSubmissionUnderSameJobId fails with this change because we it expects the unregistration to be completed before the JobManagerRunner is terminated. Need to look into this.
The test
testJobSubmissionUnderSameJobIdfails with this change because we it expects the unregistration to be completed before theJobManagerRunneris terminated. Need to look into this.
The test is fixed now.
@XComp , could you please take another look? I'll work on figuring out the following meanwhile:
I'm wondering whether you could come up with a test where we verify that the close is called in another thread and the unregister is then executed in the main thread.
@flinkbot run azure
This PR is being marked as stale since it has not had any activity in the last 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out to the community, contact details can be found here: https://flink.apache.org/what-is-flink/community/
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.