karmada icon indicating copy to clipboard operation
karmada copied to clipboard

Watch Request Blocked When Member Cluster Offline

Open xigang opened this issue 1 year ago • 11 comments

What happened:

When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.

What you expected to happen:

Should we set a timeout: Set a reasonable timeout for cache.Watch() calls using context.WithTimeout or context.WithDeadline to control the operation time?

https://github.com/karmada-io/karmada/blob/master/pkg/search/proxy/store/multi_cluster_cache.go#L354

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Karmada version:
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:

xigang avatar Oct 11 '24 07:10 xigang

/cc @RainbowMango @XiShanYongYe-Chang @ikaven1024 Let's take a look at this issue together.

xigang avatar Oct 11 '24 07:10 xigang

When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.

Are events in other normal clusters affected?

XiShanYongYe-Chang avatar Oct 11 '24 08:10 XiShanYongYe-Chang

@XiShanYongYe-Chang Unable to receive events from all member clusters through the aggregated apiserver, suspecting that the watch is blocked in cache.watch().

	clusters := c.getClusterNames()
	for i := range clusters {
		cluster := clusters[i]
		options.ResourceVersion = resourceVersion.get(cluster)
		cache := c.cacheForClusterResource(cluster, gvr)
		if cache == nil {
			continue
		}
                //串行执行 cache.Watch()
		w, err := cache.Watch(ctx, options)
		if err != nil {
			return nil, err
		}

		mux.AddSource(w, func(e watch.Event) {
			setObjectResourceVersionFunc(cluster, e.Object)
			addCacheSourceAnnotation(e.Object, cluster)
		})
	}

xigang avatar Oct 11 '24 08:10 xigang

Thank you for your reply.

According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?

XiShanYongYe-Chang avatar Oct 11 '24 10:10 XiShanYongYe-Chang

Thank you for your reply.

According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?

Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.

I will conduct a test to verify.

xigang avatar Oct 11 '24 11:10 xigang

/close

xigang avatar Oct 12 '24 02:10 xigang

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

karmada-bot avatar Oct 12 '24 02:10 karmada-bot

Hi @xigang, why close this issue?

XiShanYongYe-Chang avatar Oct 12 '24 02:10 XiShanYongYe-Chang

/reopen

xigang avatar Oct 14 '24 07:10 xigang

@xigang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

karmada-bot avatar Oct 14 '24 07:10 karmada-bot

Hi @xigang, why close this issue?

@XiShanYongYe-Chang I will submit a fix PR later.

xigang avatar Oct 14 '24 07:10 xigang

Hi @xigang, why close this issue?

@XiShanYongYe-Chang PR submitted. PTAL.

xigang avatar Oct 23 '24 14:10 xigang

Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.

I will conduct a test to verify.

Has this been confirmed? If so, they can use this case to reproduce it.

RainbowMango avatar Oct 24 '24 09:10 RainbowMango