sig-release icon indicating copy to clipboard operation
sig-release copied to clipboard

Assess version markers used in release-branch job configs

Open justaugustus opened this issue 6 years ago • 38 comments

Version markers are text files stored in the root of various GCS buckets:

  • CI: https://gcsweb.k8s.io/gcs/kubernetes-release-dev/ci
  • Release: https://gcsweb.k8s.io/gcs/kubernetes-release/release

They represent the results of different types of Kubernetes build jobs and act as sort of a public API for accessing builds. One can see them leveraged in extraction strategies for e2e tests, release engineering tooling, and user-created scripts.

Unfortunately, the way certain version markers are generated and utilized can at best be confusing, and at worst, disruptive.

There are a variety of problems, some of which are symptoms of the other ones...

Generic version markers are not explicit

We publish a set of additional generic version markers:

  • k8s-master
  • k8s-beta
  • k8s-stable1
  • k8s-stable2
  • k8s-stable3

Depending on the point in the release cycle, the meaning of these markers can change.

  • k8s-master always points to the version on master.
  • k8s-beta may represent:
    • masters build version (pre-branch cut)
    • a to-be-released build version (post-branch cut)
    • a recently released build version (post-release)

Knowing what these markers mean at any one time presumes knowledge of the build/release process or a correct interpretation of the Kubernetes versions doc, which has frequently been out of date and lives in a low-visibility location.

Manually created jobs using generic version markers can be inaccurate

Non-generated jobs using generic version markers do not get the same level of scrutiny as ones that are generated via releng/test_config.yaml.

This leads to inaccuracies between the versions presumed to be used in test and the versions that may be displayed in testgrid.

ci-kubernetes-e2e-gce-beta-stable1-gci-kubectl-skew is a great example:

https://github.com/kubernetes/test-infra/blob/96e08f4be2a86189f59c72055785f817ac346d30/config/jobs/kubernetes/sig-cli/sig-cli-config.yaml#L85-L112

All variants of that prowjob have landed on the sig-release-job-config-errors dashboard for various misconfiguration issues that are the result of generic version markers.


I'd like to establish a rough plan of record to continue iteratively fixing some of these issues.

Plan of record

  • [x] (https://github.com/kubernetes/test-infra/pull/15564) Use explicit (latest-x.y) version markers in generated jobs
  • [x] (https://github.com/kubernetes/test-infra/pull/18290, https://github.com/kubernetes/release/pull/1389) Publish fast builds to separate subdirectory to prevent collisions
  • [ ] (https://github.com/kubernetes/test-infra/pull/18169) Add a new stable4 field to the kubernetes version lists in releng/test_config.yaml to remove confusion around jobs with beta in their name
  • [ ] Refactor any non-generated jobs using generic version markers
  • [ ] Refactor any config-forked release jobs using generic version markers
  • [ ] Refactor any jobs using fork-per-release-generic-suffix: "true" annotations
  • [ ] Disable usage of fork-per-release-generic-suffix: "true" annotations
  • [x] (https://github.com/kubernetes/community/pull/5268, https://github.com/kubernetes/test-infra/pull/19686) Rewrite the Kubernetes versions doc and put it in a more visible location
  • [ ] Refactor releng/test_config.yaml to remove references to generic versions (e.g., prefer ci-kubernetes-e2enode-ubuntu1-latest-1-19-gkespec over ci-kubernetes-e2enode-ubuntu1-k8sbeta-gkespec)

Previous Issues

linux/amd64 version markers are colliding with cross builds

(Fixed in https://github.com/kubernetes/test-infra/pull/18290.)

"Fast" (linux/amd64-only) builds run every 5 minutes, while cross builds run every hour. They also write to the same version markers (latest.txt, latest-<major>.txt, latest-<major>.<minor>.txt).

The Kubernetes build jobs have a mechanism for checking if a build already exists and will exit early to save on test cycles.

What this means is if a "fast" build has already happened for a commit, then the corresponding cross build will exit without building.

This has been happening pretty consistently lately, so cross build consumers are using much older versions of Kubernetes than intended.

(Note that this condition only happens on master.)

Cross builds are stored in a separate GCS bucket

(Fixed in https://github.com/kubernetes/test-infra/pull/14030.)

This makes long-term usage of cross builds a little more difficult, since scripts utilizing version markers tend to consider only the version marker filename, while the GCS bucket name remains unparameterized.

Generated jobs may not represent intention

(Fixed in https://github.com/kubernetes/test-infra/pull/15564.)

As the generic version markers can shift throughout the release cycle, every time we regenerate jobs, they may not represent what we intend to test.

The best examples of this are pretty much every job using the k8s-beta version marker, and more specifically, skew and upgrade jobs.

bazel version markers appear to be unused

(Fixed in https://github.com/kubernetes/test-infra/pull/15612.)

ref: https://github.com/kubernetes/test-infra/pull/15106

/assign /area release-eng /priority important-longterm /milestone v1.17

justaugustus avatar Nov 05 '19 18:11 justaugustus

/cc

tpepper avatar Nov 06 '19 17:11 tpepper

To respond to @spiffxp's comment on the Branch Management issue:

I would suggest that branch management is the role that should handle "what 'channel' (beta/stable1/stable2/stable3) corresponds to which version?"

Agreed. Now codified in the Branch Management handbook.

  • the beta channel should only exist after the release-1.y branch is cut, and be unused after the v1.y.0 release is cut (aka during the period that builds being cut have the word beta in them, and the branch manager is running branchff and handling cherry-picks prior to the .0 release)
  • the stableN versions are moved forward after the v1.y.0 release is cut, so that stable1 refers to v1.y.0, the most recent stable release, stable2 refers to v1.y-1.0, the previous stable release, etc.
  • release teams have forgotten to do this last part since 1.11 (ref: ref: kubernetes/test-infra#13577 (comment)), so we're in a state where the channels don't mean what they should

other ideas include:

I liked this idea and actually mentioned this to @tpepper after bumping into https://github.com/kubernetes/test-infra/issues/15514.

My thought here is that creating the new release branch jobs immediately after the final patch release would result in turning down CI on the last-supported branch way sooner and giving time to watch the new release branch jobs stability.

This comes at the cost of more branch fast-forwards and more cherry picks.

Do we think that's worth it?

  • decide we don't like the stableN nomenclature at all, and adjust job configs and tooling appropriately (we liked this idea in https://groups.google.com/d/msg/kubernetes-sig-release/ABLP6LHIL7s/0RHWTUd9AQAJ but it proved to be trickier to implement than we had hoped (ref: kubernetes/test-infra#12516))

I'm not a fan of this nomenclature, especially because it has consistently caused confusion and inconsistency around what's under test at one period in the release cycle.

Are there any glaring things that we'd need to look out for going down this route?

justaugustus avatar Dec 09 '19 06:12 justaugustus

This comes at the cost of more branch fast-forwards and more cherry picks.

I don't think it would cause more cherry-picks? Those don't start happening until after code freeze.

I was envisioning that alpha's would get cut off of the release-1.y branch with this approach, and that master's version wouldn't bump until after code freeze. This is different than today, where master's version bumps as soon as the release branch is cut.

Are there any glaring things that we'd need to look out for going down this route?

We'll "lose" historical data for jobs on our dashboards (testgrid, triage, velodrome, etc), since none of them comprehend job renames or moves. Early in the release cycle is probably the best time to induce such a gap.

Outside of that I suspect it's not glaring things, just lots of tiny renames. @Katharine might be able to better explain what prevented us from moving ahead with the rename in https://github.com/kubernetes/test-infra/pull/12516.

spiffxp avatar Dec 09 '19 21:12 spiffxp

I don't think it would cause more cherry-picks? Those don't start happening until after code freeze.

@spiffxp -- Good point. This was mushy brain from triaging other stuff.

We'll "lose" historical data for jobs on our dashboards (testgrid, triage, velodrome, etc), since none of them comprehend job renames or moves. Early in the release cycle is probably the best time to induce such a gap.

I think I'm fine with losing some historical data if it leads to ease of management for the team over time.

@kubernetes/release-engineering -- What are your thoughts on this?

justaugustus avatar Dec 10 '19 20:12 justaugustus

Some discussion in Slack here: https://kubernetes.slack.com/archives/C09QZ4DQB/p1576104279099300 ...and I'm poking at the version markers and release job generation here: https://github.com/kubernetes/test-infra/pull/15564

justaugustus avatar Dec 11 '19 23:12 justaugustus

Here's another instance of wrestling with version markers being a general nightmare: https://github.com/kubernetes/test-infra/pull/15875

That PR should've been at most a few commits.

This cycle I'm going to be looking at renaming the release-branch jobs that reference beta,stable{1,2,3} and removing the generic suffix annotations. That should be an easy-ish way to get started.

From there, we'll need to look at refactoring generate_tests.py and test_config.yaml.

justaugustus avatar Jan 13 '20 04:01 justaugustus

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 12 '20 08:04 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar May 12 '20 08:05 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot avatar Jun 11 '20 09:06 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 11 '20 09:06 k8s-ci-robot

PR to support uploading multiple additional version markers in push-build.sh (ci-kubernetes-build* jobs): https://github.com/kubernetes/release/pull/1385

justaugustus avatar Jul 01 '20 06:07 justaugustus

@BenTheElder @spiffxp -- Updated the issue description with more details and a plan of record.

/kind bug cleanup /milestone v1.19 /remove-priority important-longterm

/priority critical-urgent (as this is impacting users who rely on the cross build markers)

justaugustus avatar Jul 16 '20 07:07 justaugustus

I'd like to make sure we unblock the cross build stuff first. A checklist of work is a good start. I still think we're lacking a description of the desired state. My guess is the desired state involves no more "generic" version markers. I'll post more detailed review when I get time later today.

spiffxp avatar Jul 16 '20 17:07 spiffxp

Heya, any updates on this since @spiffxp's last comment? Based on chats this week with key Release Team members this remains high-priority. Anything we could delegate here?

lasomethingsomething avatar Aug 05 '20 10:08 lasomethingsomething

i think we should have something like:

ci/latest
ci/latest-x
ci/latest-x.yy
release/stable
release/stable-x
release/stable-x.yy

and that's all that we need...not sure why the project would need anything else?

markers with -fast suffix seems fine for k8s maintainers, but for external consumption it does not make much sense. IMO, we should remove all k8s- markers.

skew test jobs for e.g. kubectl could use ci/latest-x.yy explicitly instead of markers that imply N-1, N-2. SIG CLI could upgrade this on each cycle or we could have automation in test-infra to update the jobs.

for the kubeadm jobs today we only use ci/latest* and we update them each cycle.

neolit123 avatar Aug 05 '20 20:08 neolit123

Most of CI should be consuming quick builds. If I wanted to support CI builds in a tool like kind I'd want the fast builds when possible. The slow builds are slow.

Oppositely, most things don't really need to slow builds. We have little to no CI that actually depend on these.

On Wed, Aug 5, 2020 at 1:16 PM Lubomir I. Ivanov [email protected] wrote:

i think we should have something like:

ci/latest ci/latest-x ci/latest-x.yy release/stable release/stable-x release/stable-x.yy

and that's all that we need...not sure why the project would need anything else?

markers with -fast suffix seems fine for k8s maintainers, but for external consumption it does not make much sense. IMO, we should remove all k8s- markers.

skew test jobs for e.g. kubectl could use ci/latest-x.yy explicitly instead of markers that imply N-1, N-2. SIG CLI could upgrade this on each cycle or we could have automation in test-infra to update the jobs.

for the kubeadm jobs today we only use ci/latest* and we update them each cycle.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/sig-release/issues/850#issuecomment-669479097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADKYR4Q7WV3CN6RREHBDR7G4ZDANCNFSM4JJGWGLA .

BenTheElder avatar Aug 06 '20 01:08 BenTheElder

Heya, any updates on this since @spiffxp's last comment?

Yep. Chatted w/ Aaron and the description of the issue reflects his request now.

Based on chats this week with key Release Team members this remains high-priority. Anything we could delegate here?

There's still a bit of cleanup to do here on my part around the generic version markers.

i think we should have something like:

ci/latest
ci/latest-x
ci/latest-x.yy
release/stable
release/stable-x
release/stable-x.yy

and that's all that we need...not sure why the project would need anything else?

There are also the release/latest**, which represents pre-release versions. https://github.com/kubernetes/test-infra/blob/master/docs/kubernetes-versions.md was updated recently to discuss the markers currently in use.

markers with -fast suffix seems fine for k8s maintainers, but for external consumption it does not make much sense.

+100

IMO, we should remove all k8s- markers.

Already part of the plan, described under Generic version markers are not explicit.

skew test jobs for e.g. kubectl could use ci/latest-x.yy explicitly instead of markers that imply N-1, N-2. SIG CLI could upgrade this on each cycle or we could have automation in test-infra to update the jobs.

Agreed. Described in Manually created jobs using generic version markers can be inaccurate and part of Refactor any non-generated jobs using generic version markers.

Most of CI should be consuming quick builds. If I wanted to support CI builds in a tool like kind I'd want the fast builds when possible. The slow builds are slow. Oppositely, most things don't really need to slow builds. We have little to no CI that actually depend on these.

Right, but this is really only true for CI. Our production consumers, release tooling, kubeadm, etc. expect a cross build to be available when traversing a marker.

Email to follow on version marker updates.

justaugustus avatar Aug 06 '20 16:08 justaugustus

it's true for anyone doing CI though, which is sort of the point of CI markers. full blown releases are different.

On Thu, Aug 6, 2020 at 9:48 AM Stephen Augustus [email protected] wrote:

Heya, any updates on this since @spiffxp https://github.com/spiffxp's last comment?

Yep. Chatted w/ Aaron and the description of the issue reflects his request now.

Based on chats this week with key Release Team members this remains high-priority. Anything we could delegate here?

There's still a bit of cleanup to do here on my part around the generic version markers.

i think we should have something like:

ci/latest ci/latest-x ci/latest-x.yy release/stable release/stable-x release/stable-x.yy

and that's all that we need...not sure why the project would need anything else?

There are also the release/latest**, which represents pre-release versions.

https://github.com/kubernetes/test-infra/blob/master/docs/kubernetes-versions.md was updated recently to discuss the markers currently in use.

markers with -fast suffix seems fine for k8s maintainers, but for external consumption it does not make much sense.

+100

IMO, we should remove all k8s- markers.

Already part of the plan, described under Generic version markers are not explicit.

skew test jobs for e.g. kubectl could use ci/latest-x.yy explicitly instead of markers that imply N-1, N-2. SIG CLI could upgrade this on each cycle or we could have automation in test-infra to update the jobs.

Agreed. Described in Manually created jobs using generic version markers can be inaccurate and part of Refactor any non-generated jobs using generic version markers.

Most of CI should be consuming quick builds. If I wanted to support CI builds in a tool like kind I'd want the fast builds when possible. The slow builds are slow. Oppositely, most things don't really need to slow builds. We have little to no CI that actually depend on these.

Right, but this is really only true for CI. Our production consumers, release tooling, kubeadm, etc. expect a cross build to be available when traversing a marker.

Email to follow on version marker updates.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/sig-release/issues/850#issuecomment-670046645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADK6FGOIRUZB4YQRWBGLR7LNGJANCNFSM4JJGWGLA .

BenTheElder avatar Aug 07 '20 00:08 BenTheElder

Opened https://github.com/kubernetes/community/pull/5268 and https://github.com/kubernetes/test-infra/pull/19686 to move the version markers doc to k/community.

justaugustus avatar Oct 26 '20 10:10 justaugustus

/remove-priority critical-urgent /priority important-soon

justaugustus avatar Feb 07 '21 01:02 justaugustus

the stableN versions are moved forward after the v1.y.0 release is cut, so that stable1 refers to v1.y.0, the most recent stable release, stable2 refers to v1.y-1.0, the previous stable release, etc.

this hasn't happened by now, and it should have (ref: https://github.com/kubernetes/test-infra/issues/19922#issuecomment-780789250)

spiffxp avatar Feb 17 '21 19:02 spiffxp

Markers in k8s-release-dev were rolled forward, markers in kubernetes-release-dev were not. PR to address here: https://github.com/kubernetes/test-infra/pull/20887

spiffxp avatar Feb 17 '21 21:02 spiffxp

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar May 18 '21 22:05 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar Jun 17 '21 22:06 fejta-bot

/remove-lifecycle rotten We got burned by k8s-beta again as part of https://github.com/kubernetes/kubernetes/issues/103697

  • https://github.com/kubernetes/test-infra/pull/22790 removed it from gs://kubernetes-release-dev
  • Nothing currently publishes it to gs://k8s-release-dev
  • https://github.com/kubernetes/test-infra/pull/22840 flipped everything to look at gs://k8s-release-dev
  • upgrade jobs referencing k8s-beta started to fail
  • I manually copied k8s-beta over, and also manually updated it to the value of latest-1.21

I know we're like... right at test freeze. And I haven't paged in all the historical context or job rotation implications etc. But I'm really tempted to say "just slam everything over to latest-1.xy and be done with it". IMO rhe contributor clarity would far outweigh the toil of updating some jobs every N months.

spiffxp avatar Jul 15 '21 15:07 spiffxp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 13 '21 16:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 12 '21 17:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Dec 12 '21 17:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 12 '21 17:12 k8s-ci-robot

/reopen My old friend continues in https://github.com/kubernetes/test-infra/pull/28079...

justaugustus avatar Nov 22 '22 01:11 justaugustus