contour
contour copied to clipboard
Support jump upgrades for better supportability
This is a request for us to explore jump upgrades ie. supporting N-2 or N-3 upgrades instead of only supporting upgrading to the next version. Right now, it can be rather cumbersome having to upgrade multiple times to reach the latest version of Contour for users that are stuck on an outdated version. If customer is on v1.8 and wants to upgrade to v1.15, that's 6 separate upgrades.
Based on https://projectcontour.io/resources/upgrading/, it seems the upgrade procedure is specific to that version, ie if something changes in a particular resource like clusterRole then that will need to be re-applied, or a recent change forces Envoy DaemonSet to be re-applied before Contour instead of vice versa. So it feels like an upgrade from v1.n to v1.n+2 or v1.n+3 is the aggregate of all incremental changes rolled into one. A couple of thoughts on what we're after
- No matter how we decide to guide the upgrade procedure, we need to validate it fully and provide support. If we start now, this ensures that every version >= 1.14 can upgrade to the latest version through N+2 / N+3 jump upgrades
- We can break this N-2 / N-3 pattern on xDS changes, ie. when we previously migrated from xDS v2 to v3. So for ex. using a purely hypothetical scenario, if v1.19 is the last version using xDS v3 and v1.20 uses xDS v4 and user is on v1.18, it’s ok to mandate upgrading to v1.19 first before finally upgrading to v1.14.
- Allow similar exceptions for ground breaking changes such as huge API overhauls, huge addition or removal of CRDs like previous IngressRoute to HTTPProxy transition
- prefer we front load this effort by supporting N-3 over N-2 version upgrades, ie so when we release v1.15, we should support upgrading from v1.12, v1.13, v1.14
- The Contour Operator should support this
Xref https://github.com/projectcontour/contour/issues/3573
thoughts on this? @stevesloka @youngnick
Some off-the-top-of-my-head thoughts:
Practically, lengthening the support period like this and allowing skip upgrades means that deprecations will need to take a multiple (probably 2-3) of the longest skip. That is, we can't deprecate and remove something unless you have to do more than one skip upgrade to do it.
In addition, if we start using a new Envoy feature (per #3573), we may need to keep the code that handles not having the new feature available around for 3 versions (this depends on if we also allow a sliding window of Envoy support).
Additionally, a requirement for skip upgrades in my view is that we test the upgrade process. Doing N-3 will mean that we will need to test:
- install N-3, upgrade to current
- install N-2., upgrade to current
- install N-1, upgrade to current.
We will also need to be careful about adding feature gates, as each time we add a feature gate, we add a dimension to testing (so with a Foo feature gate, we would need to test six upgrades:
- install N-3, foo disabled, upgrade to current
- install N-3, foo enabled, upgrade to current
- and so on.
To bootstrap this process, we will also need to test N-3 to N-2, N-2 to N-1 for a while (at least three versions).
The tl;dr - this is definitely much better for usability, but we really need some upgrade testing, and maybe a way to keep track of the testing matrix. We will also need to make a call about what we do with Envoy support, that may also blow out the testing matrix significantly.
If this is a hard need/requirement, might be simpler to just release less often. =)
Having "LTS" releases which outlines how to upgrade between versions as described in #3634 might reduce the cost since you'd only have to worry about upgrading between "LTS" versions and not intermediate versions.
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
- After 60d of inactivity, lifecycle/stale is applied
- After 30d of inactivity since lifecycle/stale was applied, the Issue is closed
You can:
- Mark this Issue as fresh by commenting
- Close this Issue
- Offer to help out with triage
Please send feedback to the #contour channel in the Kubernetes Slack
This came up as a need from a user at Kubecon just now!