operator icon indicating copy to clipboard operation
operator copied to clipboard

Allow downstream projects to catch up with upstream releases

Open markusthoemmes opened this issue 5 years ago • 26 comments

Knative currently releases on a 6 week release cadence. That's awesome for getting features out quickly, but it makes it hard for downstream projects, that might be on a slower release cadence, to keep track of upstream. We're facing that in Openshift Serverless currently. Kubeflow has fallen behind quite a bit in releases too.

Should we make it possible for such downstream projects to be able to "catch up" with the upstream project? I think so. Others do too :slightly_smiling_face:.

We had an initial discussion about this in Slack, which I'll try to summarize here.

  • (from @rhuss): If we widened our upgrade policy from N-1 -> N to N-2 -> N, we'd allow downstream releases to catch up rather than continuously falling behind.
  • (from @mattmoor, @markusthoemmes, @vaikas): To allow the operator to orchestrate an upgrade through several versions, we need a good signal that an upgrade is actually done. Our reconcilers implicitly do a lot of the upgrade logic today in that they gracefully handle new fields and default them actively for example. That work would need to be "watchable" (i.e. by being externalized into a job) so that the operator knows when version Y has been upgraded successfully, so it can move on to Z. That'd allow an upgrade through an arbitrary number of versions.

markusthoemmes avatar Jul 14 '20 16:07 markusthoemmes

@markusthoemmes: The label(s) kind/proposal cannot be applied, because the repository doesn't have them

In response to this:

Knative currently releases on a 6 week release cadence. That's awesome for getting features out quickly, but it makes it hard for downstream projects, that might be on a slower release cadence, to keep track of upstream. We're facing that in Openshift Serverless currently. Kubeflow has fallen behind quite a bit in releases too.

Should we make it possible for such downstream projects to be able to "catch up" with the upstream project? I think so. Others do too :slightly_smiling_face:.

We had an initial discussion about this in Slack, which I'll try to summarize here.

  • (from @rhuss): If we widened our upgrade policy from N-1 -> N to N-2 -> N, we'd allow downstream releases to catch up rather than continuously falling behind.
  • (from @mattmoor, @markusthoemmes, @vaikas): To allow the operator to orchestrate an upgrade through several versions, we need a good signal that an upgrade is actually done. Our reconcilers implicitly do a lot of the upgrade logic today in that they gracefully handle new fields and default them actively for example. That work would need to be "watchable" (i.e. by being externalized into a job) so that the operator knows when version Y has been upgraded successfully, so it can move on to Z. That'd allow an upgrade through an arbitrary number of versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Jul 14 '20 16:07 knative-prow-robot

We also need to think about upgrade paths that are not just our K8s resources. Our recent work on the Activator -> Autoscaler communication or Queue-Proxy -> Autoscaler communication comes to mind. We've been coding those with our N-1 -> N policy in mind, meaning that we're dropping backward compatibility paths as soon as the new release cuts (for example: Adding JSON and dropping GOB encoding, we kept the GOB path intact for N but dropped it in N+1).

To catch such changes, I'd think we need a wider upgrade policy (and wider upgrade tests) to ensure wekeep backwards compatibiliy paths intact for longer than just one release.

markusthoemmes avatar Jul 14 '20 16:07 markusthoemmes

@mattmoor also mentioned in the Slack thread to possibly lower our release cadence as we near 1.0.0 (K8s and Istio are both on a 3 month schedule).

markusthoemmes avatar Jul 14 '20 16:07 markusthoemmes

How does If we widened our upgrade policy from N-1 -> N to N-2 -> N, we'd allow downstream releases to catch up rather than continuously falling behind. solve the problem? If the downstream cadence is less than 6w, this just prolongs the suffering, but ultimately will fall behind.

vagababov avatar Jul 14 '20 17:07 vagababov

I agree with @vagababov. Going from N-1 -> N to N-2 -> N just delays the problem and makes engineering a lot more complex for big changes that require some orchestration (e.g. when Ingress went from being global to namespaced and it took 3 releases to orchestrate the rollout).

The Operator should be able to apply each version sequentially. Regarding getting signal that an upgrade is successful, how does the Operator do it for a simple N-1 -> N do it today?

JRBANCEL avatar Jul 14 '20 18:07 JRBANCEL

It should be possible for the Operator to calculate the safe upgrade path and follow it, yes.

That seems like a good path forward to me.

The Operator today doesn't have this logic, I don't believe @houshengbo

Cynocracy avatar Jul 14 '20 19:07 Cynocracy

Our reconcilers implicitly do a lot of the upgrade logic today in that they gracefully handle new fields and default them actively for example. That work would need to be "watchable" (i.e. by being externalized into a job) so that the operator knows when version Y has been upgraded successfully, so it can move on to Z. That'd allow an upgrade through an arbitrary number of versions.

Should we be doing that in the reconciler? I'd prefer personally if we put them in a Job that reflected the one-off changes in its status. Edit: It sounds like there's some consensus there from the suggestion in the quote, just double checking.

Cynocracy avatar Jul 14 '20 19:07 Cynocracy

That was another thing that came up in the slack convo: Maybe we should stop overloading our reconcilers as the vehicle for upgrades, but it'll require some careful thought about how we author our reconcilers to co-exist with that separate process.

mattmoor avatar Jul 14 '20 19:07 mattmoor

What I mentioned in the slack was roughly this. Have 3 distinct steps:

  • pre
  • upgrade
  • post If we use jobs, we have an easy signal, the job completes successfully which could trigger the next stage. If we point the step+1, so say upgrade (pass in objref) in the above example to the job created in 'pre', then it could block until the job completes successfully, post in turn would watch upgrade. If we wanted to make that ref not only be a job, we could either watch for 'ready' or 'completed', but I think using jobs has other nice properties, like retries, so perhaps we can get away with just using those. Also, I think we should only support N->N+1 (an as well we should also consider downgrades too, so far it seems it's only upgrades), so we should also support N->N-1. I had been assuming that if you want to skip 2 versions, the path would be N->N+1 and then N+1->N+2.

vaikas avatar Jul 14 '20 20:07 vaikas

How does If we widened our upgrade policy from N-1 -> N to N-2 -> N, we'd allow downstream releases to catch up rather than continuously falling behind. solve the problem? If the downstream cadence is less than 6w, this just prolongs the suffering, but ultimately will fall behind.

The above is more, that if you don't offer a N-2 -> N update a gap can only increase monotonically. If you offer a N-2 -> N update you can decrease the gap, which does not mean a complete catchup though. I agree if the the downstream cadence is far off the Knative 6 weeks, that might not help very much. But if it is close to 6 weeks and you just have some unplanned 'slips' for some release a N-2 -> N update helps to eventually catch up (maybe in 2 or 3 releases then, if not directly).

It's all about from having the possibility to catch up supported by upstream (as mentioned downstream could always go through several update steps incrementally, if there is a clear signal when N-2 -> N-1 is finished and N-1 -> N can start as discussed above).

rhuss avatar Jul 20 '20 07:07 rhuss

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Oct 19 '20 01:10 github-actions[bot]

/remove-lifecycle stale

rhuss avatar Nov 17 '20 23:11 rhuss

I think we should take a look at migration software like Flyway or Liquibase, at least for inspiration how this might be organized.

My quick and dirty idea is to:

  • have knative-migrator component, that would orchestrate migrations,
  • each knative component (serving, eventing, kafka, rabbitmq ...) could listen for migrator state (UPGRADING, STABLE, DOWNGRADING), and if necessary adjust its operation in some Safe :tm: way
  • knative-migrator will instruct each component to upgrade or downgrade to desired version
  • knative-migrator will await that all components successfully migrated before going to next release

cardil avatar Feb 05 '21 12:02 cardil

BTW. This issue should be moved to a more general place, as obviously it's related to Serving only.

cardil avatar Feb 05 '21 12:02 cardil

BTW. This issue should be moved to a more general place, as obviously it's related to Serving only.

I think it's a good fit here still since we should pivot with a single project first. The task at hand is already large enough for serving only so no need to promote it as long as we don't find somebody to work on this in a serving context only.

rhuss avatar Feb 15 '21 08:02 rhuss

@rhuss Right, but maybe it would be faster to do it in Eventing first?

cardil avatar Feb 15 '21 11:02 cardil

Should this be handled in the operations WG?

/kind process /kind enhancement /kind proposal

/triage needs-user-input

evankanderson avatar Mar 22 '21 03:03 evankanderson

It's kind of touching multiple areas. I'm not sure if the operator alone can perform everything, but I agree that operations is a good place for driving this discussion. Starting a feature track that would define the problem space and identify challenges would be helpful, too (but unfortunately I can't help here much to drive this because of time constraints, but happy to help and review things)

rhuss avatar Mar 22 '21 08:03 rhuss

/remove-triage needs-user-input /triage accepted /help-wanted

evankanderson avatar Jun 23 '21 20:06 evankanderson

/help

evankanderson avatar Jun 23 '21 20:06 evankanderson

@evankanderson: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Jun 23 '21 20:06 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Sep 22 '21 01:09 github-actions[bot]

/remove-lifecycle stale

Just out of curiosity, is there any work planned for this feature ? Or is it not a thing anymore ?

rhuss avatar Sep 23 '21 13:09 rhuss

The Operations WG has a planned feature to support orchestrated upgrades across minor versions in https://github.com/knative/operator/issues/744

markusthoemmes avatar Sep 23 '21 13:09 markusthoemmes

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Dec 23 '21 01:12 github-actions[bot]

/remove-lifecycle stale

rhuss avatar Dec 23 '21 09:12 rhuss

Since #744 was closed, close this feature as well. Orchestration of the upgrade is implemented in the kn-plugin-operator.

houshengbo avatar Aug 30 '22 02:08 houshengbo