karmada
karmada copied to clipboard
multi cluster cache list and get skip problem clusters
Signed-off-by: huangyanfeng [email protected]
What type of PR is this? /kind bug
What this PR does / why we need it:
When some member clusters are in the ready state but the member cluster apiserver has problems, search cannot obtain search results through /apis/search.karmada.io/v1alpha1/proxying/karmada/proxy
log in karmada-search
E0625 09:15:01.629270 1 cacher.go:450] cacher (xxxx): unexpected ListAndWatch error: failed to list xxx, Kind=xxxx: Get "https://10.253.48.22:6443/apis/xxx.io/v1alpha1/xxxx?limit=10000": http: server gave HTTP response to HTTPS client; reinitializing...
W0625 09:15:02.635162 1 reflector.go:424] storage/cacher.go:xxx: failed to list xxx, Kind=xxx Get "https://10.253.48.22:6443/apis/xxx.io/v1alpha1/xxx?limit=10000": http: server gave HTTP response to HTTPS client
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
`karmada-search`: multi cluster cache list and get skip problem clusters
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign ikaven1024 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
Project coverage is 28.22%. Comparing base (
1255a08) to head (8ad6730). Report is 4 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| pkg/search/proxy/store/multi_cluster_cache.go | 0.00% | 6 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #5102 +/- ##
==========================================
- Coverage 28.22% 28.22% -0.01%
==========================================
Files 632 632
Lines 43551 43554 +3
==========================================
Hits 12293 12293
- Misses 30362 30365 +3
Partials 896 896
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 28.22% <0.00%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
/retest
When some member clusters are in the ready state but the member cluster apiserver has problems,
Hi @yanfeng1992, do you mean that the health check of the member cluster kube-apiserver is normal, but other APIs cannot be accessed? When will that happen?
When some member clusters are in the ready state but the member cluster apiserver has problems,
Hi @yanfeng1992, do you mean that the health check of the member cluster kube-apiserver is normal, but other APIs cannot be accessed? When will that happen?
In the case of pull+anp, anp has a problem host cluster(karmada-apiserver)->member apiserver not ok member karmada agent -> host karmada apiserver is ok At this time, the cluster is healthy but search will have problems
@XiShanYongYe-Chang
What does host mean?
host cluster,karmada control plane cluster
I understand that we should not simply handle problematic clusters by using continue. if a cluster fails, we should inform the users.
cc @ikaven1024
I understand that we should not simply handle problematic clusters by using continue. if a cluster fails, we should inform the users.
cc @ikaven1024
I am agreed to this point. Mostly listing gets objets from cache, occasional disconnect to member clusters will not return fail. And if the disconnect is always, it should be maintained by operators.
I understand that we should not simply handle problematic clusters by using continue. if a cluster fails, we should inform the users. cc @ikaven1024
I am agreed to this point. Mostly listing gets objets from cache, occasional disconnect to member clusters will not return fail. And if the disconnect is always, it should be maintained by operators.
Is there a better way that does not affect list query and can prompt operators? Personally, I think it is more important not to affect the list query than to prompt the user. @xigang @ikaven1024
cc @RainbowMango
I have a question: If a member cluster fails, should the pods in the failed cluster still respond to user requests? If they do respond, how can the client be informed that the pods belong to a failed cluster?
If the pods in a failed cluster do not respond, will it affect the orchestration of federated workloads?