grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

clusterimpl: update picker synchronously upon receipt of configuration update

Open easwars opened this issue 9 months ago • 3 comments

https://github.com/grpc/grpc-go/issues/5469 recommends an audit of existing LB policies to ensure that they update their pickers synchronously upon receipt of a configuration update.

xds_cluster_impl_experimental does not update its picker synchronously.

  • Configuration updates are handled here: https://github.com/grpc/grpc-go/blob/0561c78c9dcb5876a140ff7be97f0f37abc36ec8/xds/internal/balancer/clusterimpl/clusterimpl.go#L208
  • In the above function, the LB policy
    • initially checks for some error conditions
    • applies new load reporting configuration, if any
    • switches to a new child policy, if the child policy name has changed
    • pushes the newly received load balancer configuration onto a channel here: https://github.com/grpc/grpc-go/blob/0561c78c9dcb5876a140ff7be97f0f37abc36ec8/xds/internal/balancer/clusterimpl/clusterimpl.go#L253
    • and returns from UpdateClientConnState
  • The newly received configuration update is processed here asynchronously: https://github.com/grpc/grpc-go/blob/0561c78c9dcb5876a140ff7be97f0f37abc36ec8/xds/internal/balancer/clusterimpl/clusterimpl.go#L475
  • And a new picker is pushed upwards here: https://github.com/grpc/grpc-go/blob/0561c78c9dcb5876a140ff7be97f0f37abc36ec8/xds/internal/balancer/clusterimpl/clusterimpl.go#L479

This needs to be changed to return a picker synchronously upon receipt of a configuration update.

easwars avatar May 07 '24 18:05 easwars

@easwars I'd be working on this issue.

aranjans avatar May 09 '24 07:05 aranjans

After some thought, I've come to the understanding that I don't necessarily like the approach taken in the PR linked above because it still causes multiple picker updates and testing is very tied to the implementation rather than behavior. Instead I suggest we take the following approach. We can do this step-by-step:

  • Stop using deprecated APIs in the gracefulswitch.Balancer
    • Currently clusterimpl has a single child and uses a gracefulswitch.Balancer to support the case where the child policy name changes. And this is currently handled by calling the SwitchTo method on the gracefulswitch.Balancer. Instead we should have clusterimpl build JSON config in the format expected by the gracefulswitch.Balancer and call UpdateClientConnState instead.
    • Add an e2e style test for this as part of this change
      • As part of looking into these changes, I realized that we have no test for this scenario. Currently clusterimpl
  • Use a grpcsync.CallbackSerializer instead of a run goroutine to synchronize updates from the channel and updates from the child policy.
    • This is greatly simplify the implementation since we will not longer require a mutex, or a closed event, or a done event etc.
  • I feel that testing coverage is quite thin for the clusterimpl LB policy as a whole.
    • We should audit C-core and/or Java and see what tests they have and add them here. Mostly e2e style tests.
  • Once we have the above changes, updating the picker synchronously would be simple:
    • As part of handling a config update from the channel, the clusterimpl would do the following:
      • inhibit picker updates from the child policy
      • process the config update from the channel
      • resume picker updates, and as part of this send a new picker to the channel
      • At the end of handing a config update, whether or not an update from a child policy was received, the clusterimpl LB policy should send a single picker update to the channel.

easwars avatar Aug 12 '24 20:08 easwars