go-control-plane icon indicating copy to clipboard operation
go-control-plane copied to clipboard

Sotw v3 server causing deadlock in LinearCache

Open rueian opened this issue 4 years ago • 8 comments

The LinearCache will clear the watches under the name of changed resources in notifyAll calls (L147): https://github.com/envoyproxy/go-control-plane/blob/996a28b416c6313efc2411e63329b0c2dc5fe24b/pkg/cache/v3/linear.go#L140-L148

However, just cleaning the the watches under the name is not enough. It needs to clean all the watches to the same chan Response. Because the Sotw v3 server creates the chan Response with only 1 buffer (L369): https://github.com/envoyproxy/go-control-plane/blob/7e211bd678b510c27fb440921158274297b2009d/pkg/server/sotw/v3/server.go#L362-L371

Consider the following sequence:

  1. The sotw server receives a DiscoveryRequest with 2 resource names, and calls the cache.CreateWatch to the LinearCache.
  2. The LinearCache registers the chan Response provided by the sotw server with 2 watch entries corresponding to the requested resources.
  3. The LinearCache's UpdateResource is called with the first resource name.
  4. The LinearCache's UpdateResource is called with the second resource name and the chan Response is blocked.
  5. The sotw server receives another DiscoveryRequest but the LinearCache is still locked therefore they are deadlocked.

The LinearCache could just maintain another chan Response -> resource name map for fast cleaning in notifyAll call. But I think the root cause is that the sotw server uses a single goroutine to handle both streams of bi-di grpc.

If it handled them in separated goroutines, there would be no such deadlock, and there might be even no need to recreate a new chan Response on each DiscoveryRequest and unregister watches on each notifyAll call in LinearCache.

rueian avatar Dec 21 '21 07:12 rueian

I have opened https://github.com/envoyproxy/go-control-plane/pull/531 to address this.

rueian avatar Dec 22 '21 00:12 rueian

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jan 21 '22 04:01 github-actions[bot]

no stalebot

rueian avatar Jan 21 '22 04:01 rueian

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 20 '22 12:02 github-actions[bot]

no stalebot

rueian avatar Feb 20 '22 17:02 rueian

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 23 '22 00:03 github-actions[bot]

no stalebot

rueian avatar Mar 23 '22 00:03 rueian

I believe this issue has been solved in recent reworks changing the channel model of the server and cache. Can you confirm if this issue is still present in recent versions?

valerian-roche avatar Mar 27 '24 22:03 valerian-roche