Sotw v3 server causing deadlock in LinearCache
The LinearCache will clear the watches under the name of changed resources in notifyAll calls (L147):
https://github.com/envoyproxy/go-control-plane/blob/996a28b416c6313efc2411e63329b0c2dc5fe24b/pkg/cache/v3/linear.go#L140-L148
However, just cleaning the the watches under the name is not enough. It needs to clean all the watches to the same chan Response. Because the Sotw v3 server creates the chan Response with only 1 buffer (L369):
https://github.com/envoyproxy/go-control-plane/blob/7e211bd678b510c27fb440921158274297b2009d/pkg/server/sotw/v3/server.go#L362-L371
Consider the following sequence:
- The sotw server receives a
DiscoveryRequestwith 2 resource names, and calls thecache.CreateWatchto the LinearCache. - The LinearCache registers the
chan Responseprovided by the sotw server with 2 watch entries corresponding to the requested resources. - The LinearCache's
UpdateResourceis called with the first resource name. - The LinearCache's
UpdateResourceis called with the second resource name and thechan Responseis blocked. - The sotw server receives another
DiscoveryRequestbut the LinearCache is still locked therefore they are deadlocked.
The LinearCache could just maintain another chan Response -> resource name map for fast cleaning in notifyAll call. But I think the root cause is that the sotw server uses a single goroutine to handle both streams of bi-di grpc.
If it handled them in separated goroutines, there would be no such deadlock, and there might be even no need to recreate a new chan Response on each DiscoveryRequest and unregister watches on each notifyAll call in LinearCache.
I have opened https://github.com/envoyproxy/go-control-plane/pull/531 to address this.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stalebot
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stalebot
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
no stalebot
I believe this issue has been solved in recent reworks changing the channel model of the server and cache. Can you confirm if this issue is still present in recent versions?