Watch Request Blocked When Member Cluster Offline
What happened:
When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.
What you expected to happen:
Should we set a timeout: Set a reasonable timeout for cache.Watch() calls using context.WithTimeout or context.WithDeadline to control the operation time?
https://github.com/karmada-io/karmada/blob/master/pkg/search/proxy/store/multi_cluster_cache.go#L354
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Karmada version:
- kubectl-karmada or karmadactl version (the result of
kubectl-karmada versionorkarmadactl version): - Others:
/cc @RainbowMango @XiShanYongYe-Chang @ikaven1024 Let's take a look at this issue together.
When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.
Are events in other normal clusters affected?
@XiShanYongYe-Chang Unable to receive events from all member clusters through the aggregated apiserver, suspecting that the watch is blocked in cache.watch().
clusters := c.getClusterNames()
for i := range clusters {
cluster := clusters[i]
options.ResourceVersion = resourceVersion.get(cluster)
cache := c.cacheForClusterResource(cluster, gvr)
if cache == nil {
continue
}
//串行执行 cache.Watch()
w, err := cache.Watch(ctx, options)
if err != nil {
return nil, err
}
mux.AddSource(w, func(e watch.Event) {
setObjectResourceVersionFunc(cluster, e.Object)
addCacheSourceAnnotation(e.Object, cluster)
})
}
Thank you for your reply.
According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?
Thank you for your reply.
According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?
Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.
I will conduct a test to verify.
/close
@xigang: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @xigang, why close this issue?
/reopen
@xigang: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @xigang, why close this issue?
@XiShanYongYe-Chang I will submit a fix PR later.
Hi @xigang, why close this issue?
@XiShanYongYe-Chang PR submitted. PTAL.
Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.
I will conduct a test to verify.
Has this been confirmed? If so, they can use this case to reproduce it.