karmada icon indicating copy to clipboard operation
karmada copied to clipboard

Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration

Open yanfeng1992 opened this issue 7 months ago • 12 comments

Signed-off-by: huangyanfeng [email protected]

What type of PR is this? /kind bug

What this PR does / why we need it:

When the server is configured with --shutdown-delay-duration, during that time it keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns a failure.

https://github.com/kubernetes/kubernetes/blob/ab3e83f73424a18f298a0050440af92d2d7c4720/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go#L386-L389

kube-apiserver --shutdown-delay-duration duration

image

The karmada-controller log when a problem occurs

`E0408 09:42:37.339849 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

E0408 09:42:47.345929 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

...(the following 10 entries of the same format have been omitted)...

E0408 09:43:37.435542 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding `

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


yanfeng1992 avatar Apr 08 '25 11:04 yanfeng1992

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 49.34%. Comparing base (ba5ffba) to head (fb0f5cc). Report is 110 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6277      +/-   ##
==========================================
+ Coverage   47.95%   49.34%   +1.39%     
==========================================
  Files         676      678       +2     
  Lines       55964    55125     -839     
==========================================
+ Hits        26837    27203     +366     
+ Misses      27355    26153    -1202     
+ Partials     1772     1769       -3     
Flag Coverage Δ
unittests 49.34% <100.00%> (+1.39%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Apr 08 '25 12:04 codecov-commenter

/lgtm /cc @XiShanYongYe-Chang @RainbowMango

whitewindmills avatar Apr 09 '25 02:04 whitewindmills

Thanks~ /assign

XiShanYongYe-Chang avatar Apr 09 '25 02:04 XiShanYongYe-Chang

Hi @yanfeng1992, thanks for your feedback.

I'd like to know more about this subject.

  1. Which businesses will be affected by this issue, and what will be the specific impact?
  2. What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

I'm wondering if we're using the readyz and healthz interfaces to indicate the ready condition of the cluster is enough.

XiShanYongYe-Chang avatar Apr 09 '25 09:04 XiShanYongYe-Chang

  1. Which businesses will be affected by this issue, and what will be the specific impact?

Causing the cluster to be set offline, but in reality, the cluster is healthy

  1. What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

During this period, it continues to process requests normally. After the shutdown delay period, the kube-apiserver still provide services because it is deployed with multi-replica rolling updates.

shutdown-delay-duration

image

@XiShanYongYe-Chang

yanfeng1992 avatar Apr 09 '25 11:04 yanfeng1992

New changes are detected. LGTM label has been removed.

karmada-bot avatar Apr 11 '25 06:04 karmada-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from whitewindmills. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Apr 11 '25 06:04 karmada-bot

/retest

yanfeng1992 avatar Apr 11 '25 08:04 yanfeng1992

Causing the cluster to be set offline, but in reality, the cluster is healthy

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

RainbowMango avatar Apr 15 '25 12:04 RainbowMango

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

--shutdown-delay-duration is configured with more than 60s

In our environment, some high-level warnings are generated when the cluster goes offline.

Will changes in cluster status also affect scheduling and cause rescheduling?

yanfeng1992 avatar Apr 21 '25 09:04 yanfeng1992

Will changes in cluster status also affect scheduling and cause rescheduling?

No, no rescheduling.

RainbowMango avatar Apr 21 '25 11:04 RainbowMango

@yanfeng1992 We discussed this PR at today's community meeting with @whitewindmills, and we are agree to move this forward, given this patch rely on the deprecated healthz endpoint which might be removed in following Kubernetes releases, we hope to add an test(probably a E2E test) to protect this case. Once Kubernetes removed the endpoint, when Karmada started to test against the version of Kubernetes, we could notice this immediately.

RainbowMango avatar Apr 22 '25 09:04 RainbowMango