osm Use go-control-plane's NewServer() and SnapshotCache system to improve ADS

Leverage go-control-plane's SnapshotCache.

The snapshot cache allows us to set a single-versioned snapshot of all xDS responses (for each proxy) in a single location, while the snapshot cache implementation maintains all state and processes for responding to the actual gRPC requests. It allows us to not worry about the implementation details of delta, vs aggregate updates, streams vs fetch, etc.

From @shashankram: we have had issues in the past with the state logic for responding to proxy Stream Resource requests, and it is very difficult to debug. Offloading this to an upstream implementation should lighten our load substantially

Mar 03 '21 04:03 draychev

incoming WIP pr this afternoon!

Apr 20 '21 20:04 eduser25

btw a working branch for this has been avaialble since immemorial times (https://github.com/eduser25/osm/commit/f4de97bae1d91445978e6fc9d3939eca703606ee), devel and exploration has been put on hold as there are some shortcomings on the ADS cache implementation of go-control-plane that have to be properly addressed first.

May 12 '21 23:05 eduser25

@eduser25 could update this issue with details on "here are some shortcomings on the ADS cache implementation of go-control-plane" ?

May 20 '21 23:05 draychev

context: https://github.com/envoyproxy/go-control-plane/issues/399

May 21 '21 18:05 eduser25

Update: Initial SnapshotCache implementation support is merged https://github.com/openservicemesh/osm/pull/3530 The issue with go-control-plane is being acknowledge and worked on: https://github.com/envoyproxy/go-control-plane/issues/431

Jun 11 '21 19:06 eduser25

I think we should look to start this up again. I found some issues in the current xDS eventing system that can lead to: signficant delays in updates to all proxies due to a single slow proxy (head of line blocking)

The pubsub channel is a 0 capacity queue. The first message gets pulled off an internal queue immediately, to begin sending out to each subscribers queue (also 0 capacity). The second publish may block, if a single queue is slow, which will queue up all prior update messages across the entire system.

Apr 13 '22 20:04 steeling

Added default label kind/needed. Please consider re-labeling this issue appropriately.

Jul 13 '22 00:07 github-actions[bot]

@steeling Can we use this issue as the root of subtasks of snapshot cache?

Aug 04 '22 16:08 allenlsy

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.

So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Aug 17 '22 20:08 steeling

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.

So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

Aug 17 '22 21:08 shashankram

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes. So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

agreed! curious on your opinion on once we migrate to snapshot cache, whether we should keep the messaging.Broker, to know to trigger config generation on changes, or to purely rely on a continuous loop to constantly trigger config generation

Aug 22 '22 19:08 steeling

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes. So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

agreed! curious on your opinion on once we migrate to snapshot cache, whether we should keep the messaging.Broker, to know to trigger config generation on changes, or to purely rely on a continuous loop to constantly trigger config generation

I don't think we should have a continuous loop to trigger config generation, but rather a combinarion of event driven approach with coalescing and the addition of a periodic reconciler such as Ticker if necessary to recover from transient inconsistencies in the system. Constantly generating config in a tight loop when there's no change in the system has the downside of both controller and Envoy operating in a tight loop all the time at 100% CPU. This is especially evident in scale environments with 1000s of pods.

The broker is really meant to provide a streamlined mechanism for event pub-sub, with additional metrics and checks in place regarding which events are broadcast, multicast, or unicast . The broker is still necessary in that regard given the snapshot cache isn't smart enough to drop events it doesn't need.

Aug 22 '22 21:08 shashankram

@steeling Does #5056 close this issue?

Sep 07 '22 18:09 keithmattix

Yes it does!

Sep 07 '22 19:09 steeling

Fixed in #5056

Sep 07 '22 21:09 keithmattix

osm osm copied to clipboard

Use go-control-plane's NewServer() and SnapshotCache system to improve ADS

osm
osm copied to clipboard