kcp icon indicating copy to clipboard operation
kcp copied to clipboard

:sparkles: Fix leader election issue with workspace controller and other KCP Controllers

Open sankar17 opened this issue 10 months ago • 16 comments

This PR addresses the following,

Change the workspace controllers start logic inside runners to fix leader election issue. The way we register controllers and define the runner is problematic, the runner calls start only. but in case leader election is lost start finishes (as it was waiting on <- ctx.Done()) which leads to the defer on the queue.Shutdown() to run. Once you shutdown a queue, there’s no way to restart it

Background: At times we faced workspace controller creation stuck at scheduling phase and never recovers. Regarding leader election the requests/events queued to both leader and other pods aswell , this makes the queue depth to grow.

sankar17 avatar Apr 04 '24 09:04 sankar17

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcp-ci-bot avatar Apr 04 '24 09:04 kcp-ci-bot

Hi @sankar17. Thanks for your PR.

I'm waiting for a kcp-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcp-ci-bot avatar Apr 04 '24 09:04 kcp-ci-bot

/ok-to-test

palnabarun avatar Apr 05 '24 19:04 palnabarun

/retest

ramramu3433 avatar Apr 11 '24 06:04 ramramu3433

/test pull-kcp-test-e2e

sankar17 avatar Apr 11 '24 07:04 sankar17

/retest

ramramu3433 avatar Apr 11 '24 08:04 ramramu3433

@sankar17 @ramramu3433 these failures across the e2e test board don't seem like flakes to me. Does make test-e2e work locally for you for this branch?

embik avatar Apr 11 '24 08:04 embik

/retest

sankar17 avatar Apr 11 '24 09:04 sankar17

@sankar17 @ramramu3433 these failures across the e2e test board don't seem like flakes to me. Does make test-e2e work locally for you for this branch?

I will test and udpate

sankar17 avatar Apr 11 '24 09:04 sankar17

@sankar17 Please consider not running re-tests when tests are failing consistently, at least not without any code changes pushed. Those tests burn CI cycles without any real reason, we already know that they don't work.

embik avatar Apr 11 '24 09:04 embik

@sankar17 Please consider not running re-tests when tests are failing consistently, at least not without any code changes pushed. Those tests burn CI cycles without any real reason, we already know that they don't work.

Sure I will make sure it works in local and do retest

sankar17 avatar Apr 11 '24 09:04 sankar17

Thanks!

embik avatar Apr 11 '24 09:04 embik

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from embik. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kcp-ci-bot avatar Apr 16 '24 09:04 kcp-ci-bot

/test pull-kcp-verify

sankar17 avatar Apr 16 '24 12:04 sankar17

/retest-required

sankar17 avatar Apr 16 '24 12:04 sankar17

@sankar17: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kcp-verify eb6571d636a45a85137ee893c159a2b219cb6ec1 link true /test pull-kcp-verify
pull-kcp-verify-codegen eb6571d636a45a85137ee893c159a2b219cb6ec1 link true /test pull-kcp-verify-codegen

Full PR test history

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kcp-ci-bot avatar Apr 29 '24 13:04 kcp-ci-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcp-ci-bot avatar May 30 '24 13:05 kcp-ci-bot

This is implemented with alternative approach https://github.com/kcp-dev/kcp/pull/3132 , hence this PR is no longer needed

sankar17 avatar May 30 '24 14:05 sankar17