karmada
karmada copied to clipboard
Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration
Signed-off-by: huangyanfeng [email protected]
What type of PR is this? /kind bug
What this PR does / why we need it:
When the server is configured with --shutdown-delay-duration, during that time it keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns a failure.
https://github.com/kubernetes/kubernetes/blob/ab3e83f73424a18f298a0050440af92d2d7c4720/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go#L386-L389
kube-apiserver --shutdown-delay-duration duration
The karmada-controller log when a problem occurs
`E0408 09:42:37.339849 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
E0408 09:42:47.345929 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
...(the following 10 entries of the same format have been omitted)...
E0408 09:43:37.435542 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding `
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 49.34%. Comparing base (
ba5ffba) to head (fb0f5cc). Report is 110 commits behind head on master.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #6277 +/- ##
==========================================
+ Coverage 47.95% 49.34% +1.39%
==========================================
Files 676 678 +2
Lines 55964 55125 -839
==========================================
+ Hits 26837 27203 +366
+ Misses 27355 26153 -1202
+ Partials 1772 1769 -3
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 49.34% <100.00%> (+1.39%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
/lgtm /cc @XiShanYongYe-Chang @RainbowMango
Thanks~ /assign
Hi @yanfeng1992, thanks for your feedback.
I'd like to know more about this subject.
- Which businesses will be affected by this issue, and what will be the specific impact?
- What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?
I'm wondering if we're using the readyz and healthz interfaces to indicate the ready condition of the cluster is enough.
- Which businesses will be affected by this issue, and what will be the specific impact?
Causing the cluster to be set offline, but in reality, the cluster is healthy
- What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?
During this period, it continues to process requests normally. After the shutdown delay period, the kube-apiserver still provide services because it is deployed with multi-replica rolling updates.
@XiShanYongYe-Chang
New changes are detected. LGTM label has been removed.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from whitewindmills. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
Causing the cluster to be set offline, but in reality, the cluster is healthy
Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?
Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the
--shutdown-delay-durationis configured with a larger period?
--shutdown-delay-duration is configured with more than 60s
In our environment, some high-level warnings are generated when the cluster goes offline.
Will changes in cluster status also affect scheduling and cause rescheduling?
Will changes in cluster status also affect scheduling and cause rescheduling?
No, no rescheduling.
@yanfeng1992
We discussed this PR at today's community meeting with @whitewindmills, and we are agree to move this forward, given this patch rely on the deprecated healthz endpoint which might be removed in following Kubernetes releases, we hope to add an test(probably a E2E test) to protect this case. Once Kubernetes removed the endpoint, when Karmada started to test against the version of Kubernetes, we could notice this immediately.