gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Canary Deployments with Gloo Federation

Open guydc opened this issue 3 years ago • 7 comments

Version

1.11.*

Is your feature request related to a problem? Please describe.

Gloo Edge supports in-place canary deployment: multiple control planes can reconcile the same CRs and produce XDS for two distinct data planes.

With Gloo Federation, It should be possible to perform a blue-green deployment that does not create any upgrade risk to existing clusters. Furthermore, Gloo Federation itself should support a blue-green deployment model, where a new federation version can be tested before it assumes control over existing clusters.

Describe the solution you'd like

This can be achieved by deploying an additional gloo-fed instance and creating new edge clusters with the latest gloo-edge version. Traffic is gradually shifted from old clusters to new ones. The Canary deployment concepts can be applied to Gloo Federation:

  • When upgrading a federated environment, two gloo-fed instances can co-exist in the federation cluster and reconcile the same CRs without collisions
  • Each gloo-fed instance is responsible for federating resources on GlooInstances with a matching gloo-edge version

Describe alternatives you've considered

No response

Additional Context

No response

guydc avatar Mar 21 '22 12:03 guydc

Need estimate or alternatives.

chrisgaun avatar Mar 29 '22 13:03 chrisgaun

Need to understand level of effort on this one @sam-heilbron

chrisgaun avatar Mar 29 '22 13:03 chrisgaun

I tested as an alternative if we can run two Gloo Federation instances at once, the second instance running in the opposite cluster (where again all clusters need to be registered and all resources deployed). I didn't like the UX of this alternative, hence it is crossed out.

But what I would like to circle back to is: How important is it to deploy gloo fed using the canary pattern?

Gloo Federation is reading the Gloo Edge instances running in the clusters, picking up some configuration applied by the user making the configuration in the clusters so that cross-cluster traffic is possible, and failover works.

From then on there aren't ongoing changes that Gloo Federation needs to reconcile. If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

rinormaloku avatar Apr 21 '22 14:04 rinormaloku

How important is it to deploy gloo fed using the canary pattern?

Gloo Fed is a privileged component that controls configuration for multiple edge control planes. I think that the blast radius from a malfunctioning new version can be significant. For example, consider a bug in the orphan termination functionality, that erases configuration from all federated clusters, leading to a complete system outage.

There are also inherent compatibility risks when following canary deployment practices for the edge control and data planes in a federated environment. Gloo Fed CRDs and clients may be incompatible with edges that are still running an older version. AFAIK, k8s CRD versioning practices are not applied, breaking changes occur from time to time, and downgrading is difficult in Gloo Edge:

  • https://github.com/solo-io/gloo/issues/5499
  • https://github.com/solo-io/gloo/issues/5663
  • https://docs.solo.io/gloo-edge/1.7.23/operations/upgrading/upgrade_steps/

IMHO, the safest way to upgrade a federated environment is:

  • spin up a new federation cluster
  • spin up and register new edge clusters
  • apply federated state
  • gradually steer traffic towards the new environment, while keeping the old environment live and up-to-date.

This scheme is not always feasible, especially when the federation clusters require state synchronization. The next best thing would be to support an in-cluster gloo fed canary deployment.

These solutions would only work if Federated CRDs are properly versioned and deprecated.

If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

If Gloo Fed is down:

  • A canary deployment process that spins up new edge clusters will fail, as new edges are not federated.
  • DR for failed edges is impossible
  • Service degrades as the system enters a "read only" state

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

It's not always possible to have a pre-prod environment that completely simulates production.

guydc avatar Apr 25 '22 07:04 guydc

It's not always possible to have a pre-prod environment that completely simulates production.

That is an issue.

If Gloo Fed is down: New edges & DR -- (those are very rare cases, with low likelihood to occur, unless the feature is used in a way that I haven't seen up to now)

  • Service degrades as the system enters a "read-only" state

The third issue is the most likely issue to occur. But the impact is completely negligible. The implementation of Gloo Edge is purposefully different from Istio, Gloo edge doesn't configure the gateway proxy with endpoints (IP addresses for every pod; a luxury that a service mesh cannot afford as it would cause excessive load on the DNS proxy).

Summary: Gloo Fed will only make tweaks when you apply Gloo Fed CRDs. Or if you change the Loadbalancer service in one of the gloo instances. (Those changes are not frequent, and at least shouldn't be done when you make a Gloo Fed update)

Though without Pre-prod environments, there is no alternative but to have some canary deployment approach to reduce the risk.

rinormaloku avatar Apr 25 '22 09:04 rinormaloku

Can limit the scope to having Gloo Fed backwards compatible with GE.

chrisgaun avatar May 17 '22 13:05 chrisgaun

Can limit the scope to having Gloo Fed backwards compatible with GE.

Right. For example, the Gloo Mesh Control Plane is compatible with n-1 version relay agents to support rolling upgrade scenarios. Ideally, Gloo Fed should have similar compatibility with Gloo Edge.

Otherwise, some form of protection is required, to ensure that state of n-1 GEs is not corrupted and that GF doesn't run into global failures due to unexpected GE version under federation.

guydc avatar Jun 21 '22 07:06 guydc

breakdown of tasks (not necessarily in order):

  • [ ] make sure Gloo Fed CRDs are backwards-compatible: https://github.com/solo-io/gloo/issues/7234
  • [ ] Gloo Fed controllers should ignore unknown fields when unmarshalling: https://github.com/solo-io/gloo/issues/7235
  • [ ] Gloo Fed helm chart should install separate RBAC resources for each Gloo Fed install namespace: https://github.com/solo-io/gloo/issues/7236
  • [ ] Gloo Fed's federated resource statuses should be namespaced by Gloo Fed install namespace: https://github.com/solo-io/gloo/issues/7237
  • [ ] make sure status marshalling/unmarshalling is both forwards and backwards compatible: https://github.com/solo-io/gloo/issues/7238
  • [ ] push any Gloo Fed API updates to solo-apis: https://github.com/solo-io/gloo/issues/7239
  • [ ] Docs: add a page about Gloo Fed canary upgrades: https://github.com/solo-io/gloo/issues/7240
  • [ ] glooctl updates (if needed): https://github.com/solo-io/gloo/issues/7241
  • [ ] UI updates (if needed): https://github.com/solo-io/gloo/issues/7242

jenshu avatar Sep 26 '22 07:09 jenshu

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

github-actions[bot] avatar Jun 02 '24 10:06 github-actions[bot]