vitess RFC: Proposed changes to GA release process

Summary

We currently have a process where we publish one or more release candidates before doing a GA release. We do a code freeze before cutting the release branch, and again before doing the GA release. We have had a couple of GA releases where we had to do patch releases almost immediately because of critical bugs that broke one of more pieces of major functionality. The most recent example of this is #15419. At that time, @L3o-pold pointed out that the GA release was not identical to a previous release candidate. This is an artifact of the current release process. We publish an RC1, and then continue to merge bug fixes on to the release branch whether or not they are reported specifically with the RC1. Unless RC1 is significantly broken, we don't typically do an RC2, we go straight to GA. What goes into the GA is just the latest state of the release branch, it is not expected to be identical to a previously published release candidate.

Shortcomings

Bugs can be introduced into the release branch after an RC which are not caught until GA
The volume of bug fixes on the release branch between RC and GA is quite high
There is a scramble on the last few days to get bug fixes "under the wire"

Proposal

Let's assume we have planned a GA release for date T.

T minus 3 weeks: Feature freeze. No new enhancements for this release cycle can be merged after this date, only bug fixes. We will cut a release branch at this date. This is the same as what we do today.
T minus 2 weeks: Publish RC1. This gives ~1 week to push bug fixes to the release branch before doing an RC.
Until T minus 1 week: We accept bug reports against RC1 and evaluate whether they should be fixed. We'll accumulate bug fixes before publishing another release candidate.
T minus 1 week: Publish RC2 if necessary
Until T: Only critical bug reports will be evaluated to determine whether the release should be pushed back.
T: GA Release which will be exactly the same code as the latest RC.

Exceptions: If there is a critical bug report after RC2, we MAY need to push the GA release out by 1 week and do an RC3 instead.

Notes:

Absolutely no enhancements should be added to the release branch. This includes performance fixes except those found as regressions from the previous release using arewefastyet

References

Current release process Release schedule, bug fix releases, support lifecycle etc.

EDIT Apr 17, 2023: Incorporate feedback from comments. EDIT2 Apr 18, 2023: Clarified GA Release relationship to RC.

Mar 27 '24 21:03 deepthi

Thanks for this @deepthi, you perfectly summarise the "issue" and your proposal match what I was thinking.

Bugs can be introduced into the release branch after an RC which are not caught until GA

it's the main issue IMO

If we fix a bug, we publish another release candidate

exactly 👍

Mar 27 '24 21:03 L3o-pold

The latest release candidate commit is used to make the GA release.

Just to be clear here, the exact same commit cannot be used as we have to push more commits during the release process of the GA. However, the release PR of GA will be based on a RC commit.

Mar 27 '24 23:03 frouioui

This proposal sounds good to me, however I have concerns about the 2-weeks period before the RC-1 release. If we want to be able to do bug fixes, I think we might as well release an RC-1 early (at the beginning of the 2 weeks period) and leave enough time for everyone in the community to test the RC-1. Because unless people are running their systems on Vitess' main, we won't get many bug reports from the community.

That would also allow us to not block development on main.

Mar 27 '24 23:03 frouioui

Once RC1 is published, we accept bug reports against that and evaluate whether they should be fixed. If we fix a bug, we publish another release candidate.

Would we do this immediately after every bug fix, or should we accumulate bug fixes for some time before doing the next RC?

Mar 28 '24 15:03 systay

Would we do this immediately after every bug fix, or should we accumulate bug fixes for some time before doing the next RC?

I agree with @systay. But, doing a RC takes some time and some planning, it will be time consuming for the release team to do a new RC after each bug fix or even every other day. I think we need some sort of cadance/schedule, just an example: one or two RC per week (if needed: when there are new bug fixes). That way the release team knows what to expect during the period between the first RC and the GA release, and a day before every scheduled RC they can evaluate if a new RC is really needed.

Mar 28 '24 20:03 frouioui

I would suggest this process:

Three weeks out: Branch off the release branch from main and switch it to bug-fix-only mode. Normal development continues on the main branch.
Two weeks out: Cut RC1 and publish it.
Bug fixes: If bugs serious enough for a new RC are identified, allocate 3-5 days for the team to fix, merge, and cut a new RC.
Stability check: Continue cutting new RCs until no critical bugs are found for a full week.
Release GA: Once stable, release the GA using the same SHA as the last RC.

Apr 15 '24 05:04 systay

We should define the minimum time gap between the last RC and the GA release which might risk postponing the release but keep the GA release more stable

Apr 15 '24 08:04 harshit-gangal

Continue cutting new RCs until no critical bugs are found for a full week.

@harshit-gangal I think given what @systay said, we would want to wait a full week before proceeding with the GA. I think it should be fine to have a flexible release dates, but that will mean more "work" to remember notifying different parties to cross-post our blog post.

Apr 16 '24 12:04 frouioui

Agree with @frouioui. We coordinate the release blog post with two other parties (CNCF and PlanetScale), so we need to have a planned date for the release. It's also important to have that for the community to make plans around releases.

Apr 17 '24 05:04 deepthi

Agree with @frouioui. We coordinate the release blog post with two other parties (CNCF and PlanetScale), so we need to have a planned date for the release. It's also important to have that for the community to make plans around releases.

I don't quite follow what this means for the suggested process. Are you saying we will do the release even if we find bugs?

I think everyone agrees we want a planned release date, the question is how to handle situations where this is hard to achieve. How do we achieve both no known bad bugs, and hit the release date?

Apr 17 '24 08:04 systay

I think everyone agrees we want a planned release date, the question is how to handle situations where this is hard to achieve. How do we achieve both no known bad bugs, and hit the release date?

We can't. The new description addresses this question, and it's consistent with your suggestion.

Apr 17 '24 08:04 deepthi

T: GA Release which is essentially the same as either RC1 or RC2.

For me it SHOULD be the same as the latest RC

Apr 17 '24 12:04 L3o-pold

T: GA Release which is essentially the same as either RC1 or RC2.

For me it SHOULD be the same as the latest RC

Thank you. That is what I was trying to convey, so I've edited that line to make it clearer. The only reason I initially said "essentially" is because we do a release commit changing the displayed version name from something like 20.0.0-rc1 to just 20.0.0.

Apr 18 '24 03:04 deepthi

T minus 3 weeks: Feature freeze. No new enhancements can be merged after this date, only bug fixes. We will cut a release branch at this date. This is the same as what we do today.

It is unclear to me if the feature freeze applies to both branches (main and the new release branch) or only to the release branch.

Apr 18 '24 15:04 frouioui

T minus 3 weeks: Feature freeze. No new enhancements can be merged after this date, only bug fixes. We will cut a release branch at this date. This is the same as what we do today.

It is unclear to me if the feature freeze applies to both branches (main and the new release branch) or only to the release branch.

Added text to make it clearer.

May 02 '24 22:05 deepthi

Talking this over with @frouioui, we realized that this approach would leave the release branch without backports for the entire duration, except for critical ones. Once the release is done and we unblock the release branch, handling the accumulated backports could become a hassle.

What if we cut the release branch (release-20.0 for the next release), and immediately fork an RC branch (release-20.0-RC) from it? This RC branch will be used for V20-RC1, V20-RC2, and V20-GA releases.

The normal release branch (release-20.0) would continue to accept bugfixes. Any critical fixes can be moved from the release branch to the RC branch.

After the release, we can delete the RC branch and use the release branch for backports and patch releases.

May 22 '24 15:05 systay

Here’s an example:

Fix1 is not a critical fix, so it's built on main and then backported to release-20.0, but not to release-20.0-RC since it's not critical.
After a week, we’re ready to cut the release, so we build RC1 from the release-20.0-RC branch.
Fix2 is a critical fix, so it's built on main and then backported to both release-20.0 and release-20.0-RC.
Since we merged something into the RC branch, after some time we will cut a new RC2 from the release-20.0-RC branch.
After waiting enough time, we can cut the GA release from the RC branch.
Fix3, another non-critical bugfix, can be merged to the release branch without hindering the release process.

main          -------Fix1-----------Fix2-------Fix3-----
               \      \              \          \
release-20      \------x--------------x----------x-----
                 \                     \
release-20-RC      \--------------------x--------------
                             \                    \    \
                            RC1                   RC2  GA

May 22 '24 15:05 systay

I think this makes a lot of sense, it will ease the job of both the release team and the rest of contributors doing backports. The RC release branch, can remain fully frozen until the end of the GA release, leaving the release team fully responsible for what gets merged and what does not. Meanwhile the normal release branch just does business as usual, removing the extra work of merging everything once the code freeze (after GA) is over.

May 22 '24 16:05 frouioui

Only thing I'd ask you to re-consider is whether the RC branch should be completely deleted after the GA release. The tag would have been applied on a commit on that branch, and it will be important to retain that for people attempting to build their own binaries from the release tag.

May 22 '24 16:05 deepthi

It would be fine to keep the branch around. We could keep it until it EOLs to avoid having dozen of branches.

But as far as I know, people can always build their own binaries from the release tag whether the commit belongs to a branch or not. For instance v15.0.0 (here) does not belong to any branch, but locally you can still checkout to the tag (git checkout v15.0.0) and build from there. You can also cherry-pick the commit, and all the commits before the tagged commit.

May 22 '24 16:05 frouioui

@systay That makes sense to me too.

May 22 '24 18:05 shlomi-noach

vitess vitess copied to clipboard

RFC: Proposed changes to GA release process

Summary

Shortcomings

Proposal

References

vitess
vitess copied to clipboard