karmada Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration

Signed-off-by: huangyanfeng [email protected]

What type of PR is this? /kind bug

What this PR does / why we need it:

When the server is configured with --shutdown-delay-duration, during that time it keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns a failure.

https://github.com/kubernetes/kubernetes/blob/ab3e83f73424a18f298a0050440af92d2d7c4720/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go#L386-L389

kube-apiserver --shutdown-delay-duration duration

The karmada-controller log when a problem occurs

`E0408 09:42:37.339849 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

E0408 09:42:47.345929 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

...(the following 10 entries of the same format have been omitted)...

E0408 09:43:37.435542 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding `

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Apr 08 '25 11:04 yanfeng1992

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 49.34%. Comparing base (ba5ffba) to head (fb0f5cc). Report is 110 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6277      +/-   ##
==========================================
+ Coverage   47.95%   49.34%   +1.39%     
==========================================
  Files         676      678       +2     
  Lines       55964    55125     -839     
==========================================
+ Hits        26837    27203     +366     
+ Misses      27355    26153    -1202     
+ Partials     1772     1769       -3

Flag	Coverage Δ
unittests	`49.34% <100.00%> (+1.39%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apr 08 '25 12:04 codecov-commenter

/lgtm /cc @XiShanYongYe-Chang @RainbowMango

Apr 09 '25 02:04 whitewindmills

Thanks~ /assign

Apr 09 '25 02:04 XiShanYongYe-Chang

Hi @yanfeng1992, thanks for your feedback.

I'd like to know more about this subject.

Which businesses will be affected by this issue, and what will be the specific impact?
What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

I'm wondering if we're using the readyz and healthz interfaces to indicate the ready condition of the cluster is enough.

Apr 09 '25 09:04 XiShanYongYe-Chang

Which businesses will be affected by this issue, and what will be the specific impact?

Causing the cluster to be set offline, but in reality, the cluster is healthy

What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

During this period, it continues to process requests normally. After the shutdown delay period, the kube-apiserver still provide services because it is deployed with multi-replica rolling updates.

shutdown-delay-duration

@XiShanYongYe-Chang

Apr 09 '25 11:04 yanfeng1992

New changes are detected. LGTM label has been removed.

Apr 11 '25 06:04 karmada-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from whitewindmills. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 11 '25 06:04 karmada-bot

/retest

Apr 11 '25 08:04 yanfeng1992

Causing the cluster to be set offline, but in reality, the cluster is healthy

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

Apr 15 '25 12:04 RainbowMango

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

--shutdown-delay-duration is configured with more than 60s

In our environment, some high-level warnings are generated when the cluster goes offline.

Will changes in cluster status also affect scheduling and cause rescheduling?

Apr 21 '25 09:04 yanfeng1992

Will changes in cluster status also affect scheduling and cause rescheduling?

No, no rescheduling.

Apr 21 '25 11:04 RainbowMango

@yanfeng1992 We discussed this PR at today's community meeting with @whitewindmills, and we are agree to move this forward, given this patch rely on the deprecated healthz endpoint which might be removed in following Kubernetes releases, we hope to add an test(probably a E2E test) to protect this case. Once Kubernetes removed the endpoint, when Karmada started to test against the version of Kubernetes, we could notice this immediately.

Apr 22 '25 09:04 RainbowMango

karmada karmada copied to clipboard

Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration

Codecov Report

karmada
karmada copied to clipboard