karmada icon indicating copy to clipboard operation
karmada copied to clipboard

multi cluster cache list and get skip problem clusters

Open yanfeng1992 opened this issue 1 year ago • 11 comments

Signed-off-by: huangyanfeng [email protected]

What type of PR is this? /kind bug

What this PR does / why we need it:

When some member clusters are in the ready state but the member cluster apiserver has problems, search cannot obtain search results through /apis/search.karmada.io/v1alpha1/proxying/karmada/proxy

log in karmada-search

E0625 09:15:01.629270       1 cacher.go:450] cacher (xxxx): unexpected ListAndWatch error: failed to list xxx, Kind=xxxx: Get "https://10.253.48.22:6443/apis/xxx.io/v1alpha1/xxxx?limit=10000": http: server gave HTTP response to HTTPS client; reinitializing...

W0625 09:15:02.635162       1 reflector.go:424] storage/cacher.go:xxx: failed to list xxx, Kind=xxx Get "https://10.253.48.22:6443/apis/xxx.io/v1alpha1/xxx?limit=10000": http: server gave HTTP response to HTTPS client

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-search`: multi cluster cache list and get skip problem clusters

yanfeng1992 avatar Jun 26 '24 07:06 yanfeng1992

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign ikaven1024 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Jun 26 '24 07:06 karmada-bot

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 28.22%. Comparing base (1255a08) to head (8ad6730). Report is 4 commits behind head on master.

Files Patch % Lines
pkg/search/proxy/store/multi_cluster_cache.go 0.00% 6 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5102      +/-   ##
==========================================
- Coverage   28.22%   28.22%   -0.01%     
==========================================
  Files         632      632              
  Lines       43551    43554       +3     
==========================================
  Hits        12293    12293              
- Misses      30362    30365       +3     
  Partials      896      896              
Flag Coverage Δ
unittests 28.22% <0.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jun 26 '24 08:06 codecov-commenter

/retest

XiShanYongYe-Chang avatar Jun 26 '24 09:06 XiShanYongYe-Chang

When some member clusters are in the ready state but the member cluster apiserver has problems,

Hi @yanfeng1992, do you mean that the health check of the member cluster kube-apiserver is normal, but other APIs cannot be accessed? When will that happen?

XiShanYongYe-Chang avatar Jun 26 '24 09:06 XiShanYongYe-Chang

When some member clusters are in the ready state but the member cluster apiserver has problems,

Hi @yanfeng1992, do you mean that the health check of the member cluster kube-apiserver is normal, but other APIs cannot be accessed? When will that happen?

In the case of pull+anp, anp has a problem host cluster(karmada-apiserver)->member apiserver not ok member karmada agent -> host karmada apiserver is ok At this time, the cluster is healthy but search will have problems

@XiShanYongYe-Chang

yanfeng1992 avatar Jun 26 '24 09:06 yanfeng1992

What does host mean?

XiShanYongYe-Chang avatar Jun 27 '24 02:06 XiShanYongYe-Chang

host cluster,karmada control plane cluster

yanfeng1992 avatar Jun 27 '24 02:06 yanfeng1992

I understand that we should not simply handle problematic clusters by using continue. if a cluster fails, we should inform the users.

cc @ikaven1024

xigang avatar Jul 01 '24 14:07 xigang

I understand that we should not simply handle problematic clusters by using continue. if a cluster fails, we should inform the users.

cc @ikaven1024

I am agreed to this point. Mostly listing gets objets from cache, occasional disconnect to member clusters will not return fail. And if the disconnect is always, it should be maintained by operators.

ikaven1024 avatar Jul 02 '24 02:07 ikaven1024

I understand that we should not simply handle problematic clusters by using continue. if a cluster fails, we should inform the users. cc @ikaven1024

I am agreed to this point. Mostly listing gets objets from cache, occasional disconnect to member clusters will not return fail. And if the disconnect is always, it should be maintained by operators.

Is there a better way that does not affect list query and can prompt operators? Personally, I think it is more important not to affect the list query than to prompt the user. @xigang @ikaven1024

cc @RainbowMango

yanfeng1992 avatar Jul 02 '24 03:07 yanfeng1992

I have a question: If a member cluster fails, should the pods in the failed cluster still respond to user requests? If they do respond, how can the client be informed that the pods belong to a failed cluster?

If the pods in a failed cluster do not respond, will it affect the orchestration of federated workloads?

xigang avatar Jul 02 '24 04:07 xigang