fiaas-deploy-daemon icon indicating copy to clipboard operation
fiaas-deploy-daemon copied to clipboard

RFC: Semantic versioning based release process

Open oyvindio opened this issue 2 years ago β€’ 9 comments

Background

The current "continuous delivery"/rolling release like workflow for releasing fiaas-deploy-daemon works by automatically having successful CI builds of the master branch update the latest tag, and have the stable tag updated by a separate manual promotion stage in the CI tool. The tags themselves are json files contained within the fiaas/releases repository. Skipper can then be configured to deploy the version of fiaas-deploy-daemon each of these tags point to. Since https://github.com/fiaas/skipper/pull/119, skipper also supports specifying the version of fiaas-deploy-daemon as a configuration flag in its helm chart.

This workflow originates from when FIAAS was still an internal project, and was intended to make it easy to roll out new releases of fiaas-deploy-daemon in several clusters. In an open source context however I think there are some drawbacks to this model;

  • The latest tag should be kept functional, but every pull request merge is effectively making a release. Because of this it is difficult to test a new release before creating it, beyond what unit/e2e tests can cover, since a container image is not produced until merging to master.
  • The current release model's lacks a clear cut-off point. This makes it more difficult to release backwards incompatible changes, such as removal of deprecated features, or support for old Kubernetes versions. This results in this type of change taking longer to roll out.
  • There is an implicit social contract to check off in the #fiaas Slack channel before promoting a version to the stable tag. This can mean longer lead times before changes are available to roll out in all clusters as changes may require testing by other operators before promoting to the stable tag.

Goals

  • Make it easier to release changes that introduce some level of backward incompatibility, such as removing support for older Kubernetes versions.
  • Reduce lead time for merging pull requests
    • Currently merging to master means updating the latest tag, which can affect other FIAAS operators
  • Reduce lead time from merging pull request to having changes in production
    • If using the stable tag in production (as is recommended), it can take some time before that tag can be promoted, because other operators might need time to test (see below)
  • Reduce need to synchronize/check off with other operators when shipping a new version of fiaas-deploy-daemon.

Suggested Process

The release model for fiaas-deploy-daemon should move towards a more typical release process of creating tagged releases, where semantic versioning is used to indicate backwards incompatible changes. For a consistent developer experience, creating a new release can be done the same way as in k8s.

Release Process

  • CI builds of the master branch pushes a versioned container image, and also updates the development (or something similar indicating this is not intended for production usage) container image tag.
    • This will make it easier for FIAAS operators to test unreleased changes in staging environments.
  • To create a versioned release, tag the commit to release with an annotated git tag where the release version is the name of the tag.
    • The tag CI build will:
      • Push a container image tag for the release version
      • Create a Github release for the tag and include a changelog
  • To patch an older versioned release, when the latest release has backwards incompatible changes not wanted in the patch release, start a branch at the release tag, commit changes or backport commits. Push another tag which increments the patch version to create a patch release.

Versioning

fiaas-deploy-daemon will use semantic versioning:

  • Incrementing the major version indicates a backwards incompatible change, such as removing a feature or removing support for a Kubernetes version
  • Incrementing the minor version indicates a backwards compatible change, such as adding new feature
  • Incrementing the patch version indicates a (backwards compatible) bug fix.

Skipper

There are a few approaches for how we can handle Skipper when transitioning to a new release model for fiaas-deploy-daemon.

Update Skipper to support the new release model

Skipper could be updated to support the new release model. It would require some modifications;

  • CI builds could continue to update the fiaas/releases repository with release metadata for each release that skipper could use to know what releases are available.
  • Skipper could use the Github API to discover releases. This wouldn't work for deploying unreleased changes from container images built from master builds, which would be a significant drawback.

Deprecate Skipper

From my perspective we don't get a lot of value from using Skipper to deploy fiaas-deploy-daemon. As such one option to simplify the release process changes, as well as the deployment process itself, could be to deprecate skipper together with the change in release model, and switch to providing a helm chart (or similar) for deploying fiaas-deploy-daemon directly.

This approach would have some benefits:

  • Simplifies workflow for updating fiaas-deploy-daemon provided operators have a mechanism to install helm charts in multiple namespaces.
  • fiaas-deploy-daemon would no longer deploy itself. This makes rolling out new changes safer, and makes it easier to recover from failure modes where the running version of fiaas-deploy-daemon is not able to deploy applications.
  • The fiaas-deploy-daemon-bootstrap entrypoint and associated code paths in fiaas-deploy-daemon could be deprecated and removed.
  • Simplifies implementing the new release model, since skipper itself doesn't have to be modified to support it.

Implementing The Suggested Process

Assuming we move forward with creating a helm chart for deploying fiaas-deploy-daemon and deprecate skipper, the transition to the suggested release process can look like the following.

  • Create a helm chart for installing fiaas-deploy-daemon
  • Implement suggested release process for fiaas-deploy-daemon via tag CI builds
  • Create the first release of fiaas-deploy-daemon via the new model
  • Update latest and (eventually stable) tags to point to the container image of the new release
    • Latest and stable tags will no longer be updated from this point
    • Operators still using skipper can hard-code the fiaas-deploy-daemon container image reference via the skipper helm chart if updating fiaas-deploy-daemon is necessary before switching to deploying it via the helm chart is possible.
  • Update documentation for new release process, mark skipper repo as deprecated.
  • Deprecate and set a date for removal of the bootstrap endpoint in fiaas-deploy-daemon.
    • Removing this feature will make it impossible to deploy fiaas-deploy-daemon via skipper.
  • Set a date for archiving the skipper repository; archive it on the specified date.

oyvindio avatar Nov 17 '21 08:11 oyvindio

While I see how the suggested process will improve the three points mentioned at the start, I'm not sure this is the best solution. In my view, the suggested process will introduce other problems that aren't properly accounted for, and which I believe might become bigger problems in the long run.

No more continuous deploy of fiaas

This is probably the biggest issue for me. Both on principle and practical.

We built fiaas because we believe that continuous deploy is the best and safest way to deploy software in a fast moving world built on kubernetes and container orchestrating. When you believe that, it's hard to see how not doing CD is the right thing for our own software.

In practice, this change will also mean that every operator that uses fiaas needs to do much more work to keep updated. In many cases today, you don't actually have to do anything to keep up to date with the latest changes/features in fiaas. You might need to keep your cluster updated, and get involved when larger issues are discussed, but minor fixes and improvements can be rolled out without you needing to spend any effort. When moving to a model based on strictly versioned helm charts, every bugfix or improvement that is to be deployed to your cluster needs to be manually handled.

Loss of momentum

While the current situation requires coordination across organisations to get stable moved forward, this has a side effect of actually making organisations engage in the "head" development of fiaas. When moving to a model with release branches and back-porting, I think there is a risk of organisations "settling" for using a release branch with the occasional back-port until there is something they really need that can't be back-ported. This means that for every new feature at the "head" of the development tree (master branch), less people/organisations will be involved in designing and implementing it, leading to features that might be a bad fit for other organisations/use cases.



I think there is a description of the original idea behind Skipper somewhere, but I can't find it. I'm guessing it's either lost, or hidden in some Schibsted-internal tooling :stuck_out_tongue:. What we have now is only part of that idea, implemented more like an MVP than a fully delivered concept. My gut feeling is that if we had implemented the original concept fully then at least some of the problems mentioned at the start here would have been less prominent.

In short, the idea was that in addition to latest and stable we had additional org-specific latest and stable channels for the "leading" orgs (however you want to define that). When FINN wanted to test latest, they would promote it to finn:latest, and deploy that a suitable place. When they were satisfied they would promote it to finn:stable and deploy that to production. When other leaders promoted to their own stable channels (adevinta:stable and lbc:stable say), the stable channel would be moved forward automatically according to some sensible algorithm (last common build or something like that). That way smaller orgs could "piggyback" on the efforts of the larger orgs and get a stable that kept moving forward, while the leading orgs would be able to move forwards at their own pace.

It would require work to improve how we treat and work with channels today, and it wouldn't solve all the problems. In particular, you would not solve the problem of deploying backwards incompatible changes, but I feel that is a property of doing CD you have to find a way to live with. In the rare cases where the change A->B is incompatible, the proper solution might involve finding an other path from A to B, where each step is compatible with the previous.

Another point I'd like to make is that it would be possible to improve the test suite to a point where you feel confident that when the tests pass, this version is at least good enough for latest. Over at NAIS we have so much confidence in our tests that we always deploy to all clusters (dev and prod) if the tests are green.

mortenlj avatar Nov 17 '21 10:11 mortenlj

Yes the suggested model is not perfect and probably has several drawbacks. I think it might still be an improvement on the current model

No more continuous deploy of fiaas

I think this might already be the situation: I would not say that we deploy fiaas-deploy-daemon continuously, but rather update it manually via skipper. It might be interesting to hear if this is how other operators work too, or if the auto update feature is used extensively.

Continuous deployment is no doubt an excellent way to work assuming one has good monitoring and end to end control of a system, for example within a single organisation. FIAAS is built to support that workflow. I don't think it is always the best model for everything though. For FIAAS itself for example, with multiple operators and where different operators may also use a different set of features, I don't think it is necessarily a good model based on how it is used in practice. Assuming that most people are updating fiaas-deploy-daemon manually already, I think that a model that uses versioned releases would be more suitable, since it among other things makes it clearer what has changed.

Loss of momentum

There is the possibility that operators may run older releases for some time, but I think that in general the upgrade path would be to move on to the most recent release and not to patch an old release. I see the ability to patch a older release as an exception, not the rule. It could for example be an option for cases where it is necessary to support a old version of Kubernetes for some time (i.e. temporarily), when that version is no longer supported in the most recent release.

In general I think that the release model suggested above could increase momentum, because it makes it easier to e.g. remove deprecated features and stop supporting older Kubernetes versions. These are things that simplify the software and make it easier to change.

org-specific latest and stable channels

I think a setup with operator specific channels might be an improvement on the current model in some aspects, but a setup like that could have other drawbacks: It seems to me more complex, and might lead to operators maintaining their own labels and only using those, which might make the stable channel less "stable". I like the simplicity of a versioned release model, and I think improving how backwards incompatible changes are handled is a significant benefit of the suggested model.

oyvindio avatar Nov 17 '21 16:11 oyvindio

I support this proposal.

There are pros/cons as any change and I understand most of Morten points. But I think is a shift for the better overall. In short:

  • it aligns better with the k8s community and how other operators are released. By aligning, we also gain onboarding speed to the project, as not many things need to be explained in this respect (today, devs need to understand the specificities of how the project is released and why).
  • it moves the release process (CD) to be a concern of the operator, which I think is better in a multi-operator project like FIAAS.

I think this might already be the situation: I would not say that we deploy fiaas-deploy-daemon continuously, but rather update it manually via skipper. It might be interesting to hear if this is how other operators work too, or if the auto update feature is used extensively.

Adevinta uses this feature extensively as we do believe CD is the best path forward. However, not having skipper is actually a benefit for us as we can leverage our already existing CD process and align with other helm chart deployments.

When you believe that, it's hard to see how not doing CD is the right thing for our own software.

From my view, it's not that we are not doing CD anymore (or that we don't want to), but that CD is moved to an operator concern (instead of a built-in FIAAS feature).

xavileon avatar Nov 18 '21 09:11 xavileon

In general I think that the release model suggested above could increase momentum, because it makes it easier to e.g. remove deprecated features and stop supporting older Kubernetes versions. These are things that simplify the software and make it easier to change.

Wholeheartedly agree to this. I see there's a lot of concern in updating or adding features to the current version out of fear of introducing breaking changes to existing users. I guess a (new) tag (current?) that tracks the latest version should be maintained for users that still want to stick to the bleeding edge.

henrik242 avatar Nov 18 '21 09:11 henrik242

it moves the release process (CD) to be a concern of the operator, which I think is better in a multi-operator project like FIAAS.

This is a good point and I think it summarizes well what I would like to improve with this proposal in terms of process. πŸ‘

oyvindio avatar Nov 22 '21 08:11 oyvindio

Excited about this RFC and the discussion here. Thanks for spending the time on crafting this, Øyvind!

I guess I'm just adding my pebbles to the pond with my comments below. πŸ™‚

In practice, this change will also mean that every operator that uses fiaas needs to do much more work to keep updated.

Yes, and no. Releasing new features/deprecations in FIAAS until now - correct me if I'm wrong - required a lot of work from maintainers and operators because of coordination. That work will now be gone, freeing up resources and making prioritizing easier.

I think there is a risk of organisations "settling" for using a release branch with the occasional back-port until there is something they really need that can't be back-ported.

That is a valid observation Morten, but is it actually something we need to care about?

xamebax avatar Nov 29 '21 13:11 xamebax

That is a valid observation Morten, but is it actually something we need to care about?

I'm not sure tbh. :slightly_smiling_face: I believe it will happen, because that's just how things work (either it's a perceived security/stability thing, or just "haven't got time for this" thing). I'm less certain about it being a problem, but ideally users would be involved in discussions about new features that they might use. If they are staying behind on an old version, that discussion is less likely to take place (because new features aren't interesting for them until they get around to upgrading), which again means we might be designing features that won't match their needs when they get there.

mortenlj avatar Dec 01 '21 11:12 mortenlj

I support this as well, thanks for the writeup Oyvind. πŸ‘

From my view, it's not that we are not doing CD anymore (or that we don't want to), but that CD is moved to an operator concern (instead of a built-in FIAAS feature).

Just to add to this: At a past point in time at Schibsted we were operating FIAAS across multiple multi-tenant clusters separate from cluster operators. Operations of FIAAS on those clusters was effectively delegated to us. We were actively working on several features to support the needs at the time which required us to shorten the feedback loop and get features to users outside the rhythm of regular cluster maintenance which we had less to do with. At the time there were also new tenants being added that required quickly bootstrapping FIAAS in their namespaces as part of an automated onboarding process. With hundreds of instances of fiaas-deploy-daemon across multiple clusters, pushing updates was a pain point. Introducing Skipper and release channels for FIAAS helped us achieve what we needed to support the use cases above and be able to operate the instances at scale. Since a couple of years back we have made changes to how we are managing FIAAS across our clusters. We have moved to manage FIAAS as part of normal cluster operations and then the need for Skipper is not there in the same sense. Ideally we want to be specific about versions and be in control when we choose to upgrade.

At this point I think it makes sense to move to the suggested model for being able to avoid being limited by backwards compatibility, support for deprecated kubernetes versions, speeding up being able to make contributions and avoid needing to synchronise with other operators before being able to cut a release.

birgirst avatar Dec 20 '21 08:12 birgirst

Thanks for the feedback, and thanks for also including the historical perspective, Birgir. πŸ‘

It has been some time now and there has been a few comments on this suggestion. As I read all the feedback, there are some concerns, but it seems to me that most of the feedback supports implementing the proposed release model. Based on that I would like to start the technical implementation of this proposal. Expect a pull request(s) as soon as there is some available capacity to work on this.

oyvindio avatar Jan 14 '22 15:01 oyvindio

I'm closing this as I've merged #180 which implements release tooling to create releases based on semantic versioned git tags, and created release v1.0.0.

If you need to create a release, take a look at the "Creating a release" part of the developer documentation.

oyvindio avatar Sep 26 '22 11:09 oyvindio