dotnet-buildtools-prereqs-docker icon indicating copy to clipboard operation
dotnet-buildtools-prereqs-docker copied to clipboard

Consider creating floating OS tags for use by repos which aren't dotnet/runtime

Open mconnew opened this issue 11 months ago • 24 comments

In PR 1274 the following statement is made:

We use version specific references in infra to ensure that our CI builds are reliable. One can imagine using floating OS tags (such as debian-oldest and debian-latest), however such an approach would be guaranteed to break our build. We know that since we often see build and test breaks that need addressing in PRs where we update build and test images.

While this statement may be true for the dotnet/runtime repo, it's not for others. The WCF repo hasn't ever had a build break due to incompatibilities from a Linux distro update. With only one exception (and that's Windows only), we have zero native platform calls in our code base. If there ever is an implementation incompatibility with a new OS distro, it will get discovered and fixed in the dotnet implementation in the runtime repo before we ever use that distro.

What we have had happen multiple times is our build break due to a Linux distros being retired in the build system and we weren't informed that was going to happen.

Having someone who is working on retiring older container images for unsupported distros having to reach out to each repo that is still using it, and then having someone on that team make a change, submit a PR, verify everything is still working, then merge it, that's all a manual process with lots of room for mistakes to be made. It also adds extra non-productive work which is a lot more disruptive and costly for smaller dev teams.

Instead of a concept of latest, (eg debian-latest), what about having a tag for the current distro? Eg for debian it would be something like debian-current. The current tag would get moved to a new distro once it's been verified that the dotnet/runtime repo is passing all its tests (or at least any failures aren't due to the new distro) on a new distro version. Maybe even wait a month after that to allow any repos with an infrequent darc update period will pick up a newer build of dotnet if any product fixes were needed. Any repo that's sensitive to distro changes and want to manually and intentionally move forward can stick to the current distro version named tags, but repos that are happy with the latest that dotnet/runtime has resolved any issues for can use the -current suffix? It will cut down on build system maintenance for teams like WCF where developer time is already stretched very thin.

mconnew avatar Dec 04 '24 20:12 mconnew

cc @richlander

mthalman avatar Dec 04 '24 20:12 mthalman

This is good topic. Yes, the content was written with a dotnet/runtime bias.

We could consider this. I certainly believe you that Linux breaks are low. The major point of the bottom part of the product is to insulate most libraries and apps from pain. If we did this, there would be a risk of CI breakage, but it likely isn't that high. It would be very similar to people who use ubuntu-latest for GitHub Actions, which I assume is a lot.

The key question is how many repos would use this model. Assuming we agree to go this direction, is it worth the effort to maintain the model as you propose. I think if we got a half-dozen repos to sign up for that, it would make sense.

richlander avatar Dec 04 '24 22:12 richlander

After a tiny bit more thought, I think we'd want to take a page from our build images: https://github.com/dotnet/runtime/blob/main/docs/workflow/using-docker.md#the-official-runtime-docker-images

I propose a model like this:

  • Offer convenience tags for Alpine and Ubuntu (to start; limit scope).
  • Add a .NET version to the tags to avoid undesirable coupling between release branches.
  • Offer just one tag per distro, likely aligning with latest.

I'm thinking of something like: alpine-net10.0.

Alternatively, is there a reason why Arcade-provided templates are not updating these tags for you? That's another way to solve this problem.

richlander avatar Dec 04 '24 23:12 richlander

If there's a way to get something like darc to update them, that works for me too. I don't know enough about the inner workings of darc, whether you can have a concept of a version of something which isn't a package, but if it could automatically update our versions file with a string which could be used as the basis of a container name, then that would be a workable solution.

We do have a property to specify which .NET version to publish our tests to for sending to Helix, so we could use that value as part of the container tag. Eg we would do something like this:

    <HelixTargetQueue Include="(Debian-$(XUnitPublishTargetFramework))[email protected]/dotnet-buildtools/prereqs:debian-$(XUnitPublishTargetFramework)" />

I'm happy with either option, or some other solution, my main goal is to not have to update things because a distro version is out of support. We really only run on a broad range of distros because in the early days of .NET Core, the WCF tests caught subtle differences in networking/security related behaviors between distros that corefx didn't as our tests covered a wider variety of scenarios (back when different libcurl installations could be compiled with different options). We haven't caught a problem which the runtime missed in a long time now, so it's more of a sanity check in case a customer has a problem with a specific distro we can be confident it's something specific to their setup rather than a fundamental problem.

mconnew avatar Dec 05 '24 00:12 mconnew

@mthalman @mmitche -- Who are the best folks to consider what a good model would look like? This isn't pressing.

I agree with @mconnew that we should consider aligning ease-of-use and reliability concerns. Not every repo is as "concerning" as dotnet/runtime. On the flip-side, old container images are not serving any purpose for us.

richlander avatar Dec 05 '24 04:12 richlander

A PR flow model would be broadly useful. Most product repos would be able to utilize this. This would reduce manual work while keeping more repos on the latest.

MichaelSimons avatar Dec 05 '24 14:12 MichaelSimons

I am guessing that repos split via MSBuild vs yaml, w/how the container images are specified. I am also guessing that the repos to start with would be the MSBuild ones. Is that right?

richlander avatar Dec 05 '24 16:12 richlander

@MichaelSimons When you talk about a PR model, do you mean something that would automatically move from debian-11 to debian-12? or from specific image builds of debian-11?

I generally favor a tag model. OS version specific tags (debian-11), OS family specific tags (debian-latest), and specific image references where needed (e.g. a break occurs) over a PR model. Our existing PR model that uses darc is frequently an invitation to stay on old versions of dependencies, rather than upgrade them. This is typically because of a couple factors:

  • Subscriptions have to be managed properly for many target+repo branch+dependency combinations. This is a challenge already.
  • Stale and ignored PRs are common. Repos that do not get a lot of attention are frequently just left with open PRs. Auto-merge is not an option here. We shouldn't be making changes to repositories without human signoff.

Consider the 'simplest' dependency we have sitting around that flows widely, arcade. There are repositories with ancient versions of arcade sitting around simply because the branch or repo doesn't get a lot of attention. Other non-arcade dependencies tend to be much worse.

I think we'd be inviting less uniformity and up-to-date-ness (important for security!) than if we encouraged folks to just take a latest tag.

The other thing is that a tag model is already implementable in the current system. Maestro changes to update images are non-trivial and is a huge feature ask.

mmitche avatar Dec 05 '24 16:12 mmitche

@MichaelSimons When you talk about a PR model, do you mean something that would automatically move from debian-11 to debian-12? or from specific image builds of debian-11?

I am referring to opening a PR to upgrade between OS versions. This is when there are risks associated with the upgrade. We already have OS version specific tags. The tags get updated frequently today via automation to pick up the latest security fixes. Repos automatically get the updates because of the "floating" nature of the tags. This work well.

MichaelSimons avatar Dec 05 '24 17:12 MichaelSimons

I am referring to opening a PR to upgrade between OS versions. This is when there are risks associated with the upgrade. We already have OS version specific tags. The tags get updated frequently today via automation to pick up the latest security fixes. Repos automatically get the updates because of the "floating" nature of the tags. This work well.

I'm sure this could be done using Renovate. That wouldn't require any infrastructure work on our part beyond a configuration file in the repo. As long as we defined the tags in such a way that Renovate can interpret the version-ness, it might just work.

mthalman avatar Dec 05 '24 17:12 mthalman

We should probably have a meeting about this. I suspect that there is a lot of context that isn't shared across everyone. Isn't there a "CI Council" group. Can we crash your meeting?

We'd like to float this idea with you. I can tell that there are some fixed views on this, but I imagine we can design a better system if we attag it together.

richlander avatar Dec 06 '24 03:12 richlander

Yeah we can talk about this. To me, the questions that are important are who are the consumers who would favor a PR model vs. -current vs. fine as it is today? @mconnew thinks that a -current tag is just fine for WCF and some other repos. I'm skeptical that runtime, sdk, and aspnetcore would subscribe to a PR model that moves between versions. I think they tend to want to be on a range of specific versions today. However, I could be wrong about that.

mmitche avatar Dec 06 '24 15:12 mmitche

dotnet/runtime won't be moving to that model. In fact, the runtime and libraries test legs don't use the same strategies and are also responsible for testing everything. If we enabled this model, it would only be for repos that need minimal coverage. We'd only do that for a small number of distros. I think we should only consider this proposal if we can get a dozen or more repos that want this model.

richlander avatar Dec 06 '24 15:12 richlander

There was a meeting between @MichaelSimons, @mmitche, and @mthalman to discuss this.

We propose the following:

  • Include "current version" tags to satisfy those repos like wcf which are not (or less) susceptible to breaks when moving between distro versions.
  • Continue to provide version-specific tags for those repos which need to intentionally move between versions.
  • Implement an automated PR creation process that repos can opt into to make use of version-specific tags and have them updated by a bot. I'm going to investigate the feasibility of using of Renovate for this.

mthalman avatar Dec 06 '24 16:12 mthalman

That sounds good. Naturally, we'll want a short spec written on the scheme if we go that direction.

I think we'll want a new team created that we can at-mention on issues when we are in process of updating tags. I see a benefit for adopting this model for as few distros as possible. That way, when we at-mention the team, there is a high likelihood that we're updating a distro they care about. I propose we go with Ubuntu and Alpine, only.

richlander avatar Dec 06 '24 16:12 richlander

I talked to @jkotas about this. He was saying that the best model for runtime would be PR-driven digest updates to avoid surprises, for both build and test images.

I think the following might be the best model, starting with the most conservative:

  • Digest references w/PR driven updates
  • Fully-qualified version tag references w/PR or human driven updates
  • Floating tag (per .NET version) w/PR or human driven updates

I like the idea of using this opportunity (if we move forward with it) to improve both efficiency, compliance, and test coverage.

An open question is which model that ASP.NET Core would prefer.

richlander avatar Dec 17 '24 20:12 richlander

Digest references w/PR driven updates

What problem is this solving? I am not in favor of this, it is a step backwards. It will slow the rollout of image rebuilds to pickup CVE fixes, therefore exploding the number of EOL images in use across .NET which is already a problem. Furthermore the readability of digest references is terrible as the digest gives no indication of the distro/version in use.

MichaelSimons avatar Dec 17 '24 20:12 MichaelSimons

An open question is which model that ASP.NET Core would prefer.

I like the "Floating tag (per .NET version) w/PR or human driven updates" model - I'd be hesitant to use a -latest image in a servicing branch, in case some tooling we rely on becomes unavailable when the underlying image changes (e.g. a specific version of node, etc). We tend to update our images in main around once per year, and in servicing when an image goes EOL (which often comes with some pains of spinning up a new image w/ the right tooling installed).

If we have another meeting about this, I can represent aspnetcore

wtgodbe avatar Dec 17 '24 20:12 wtgodbe

Furthermore the readability of digest references is terrible as the digest gives no indication of the distro/version in use.

This isn't a factor. You can specify both the tag and the digest in the image name reference. The tag becomes just for human readability at that point.

mthalman avatar Dec 17 '24 21:12 mthalman

What problem is this solving?

It would be solving surprise build breaks from build images updates we get in dotnet/runtime every once in a while. Here is an example of a recent surprise build break caused by build image updates:

  • https://github.com/dotnet/runtime/issues/110517
  • https://github.com/dotnet/runtime/pull/110514
  • https://github.com/dotnet/runtime/pull/110614
  • https://github.com/dotnet/runtime/pull/110792

The question is: Do we want dotnet/runtime engineering system to be predictable? As in, if the build of given SHA succeeds on the official build CI machine today, does the construction of the engineering system guarantee that the build succeeds tomorrow too?

jkotas avatar Dec 17 '24 22:12 jkotas

The question is: Do we want dotnet/runtime engineering system to be predictable? As in, if the build of given SHA succeeds on the official build CI machine today, does the construction of the engineering system guarantee that the build succeeds tomorrow too?

FWIW if a build happens on directly on a build agent (e.g. windows), we don't have this guarantee. The VM's the agents are based on get patched regularly. That being said I acknowledge this is a potential value add we get from using containers.

My concerns with using digests/shas is that they will be changing frequently across all branches. Keeping everything up-to-date is going to take a non-trivial amount of work and even if kept up, it is going to generate a lot of compliance alerts related to the use of images with vulnerabilities and/or EOL images.

It feels like a balance/ROI decision. As noted in the one issue referenced, there may be things we can do to make our builds more resilient to changes. Perhaps there is a difference between an explicit Dockerfile change versus a triggered rebuild to pick-up CVE updates. By that I mean triggered rebuilds to pick-up CVEs do not require repo source updates while Dockerfile changes would (via auto PRs).

MichaelSimons avatar Dec 17 '24 23:12 MichaelSimons

if a build happens on directly on a build agent (e.g. windows), we don't have this guarantee.

Yes, and we have been complaining about this for years. It was worse when we were building with preview VS versions. It got better recently when we switched to building with stable VS versions.

It is ok to say that we prefer aggressive rollout of image updates (no PRs) and declare the engineering system unpredictability as "by design". We should acknowledge that it is not a cost-free and stress-free option.

jkotas avatar Dec 17 '24 23:12 jkotas

We can uplevel this ... there are two major proposals on the table:

  • Offer floating tags
  • Offer a PR-driven strategy with a new tool (that @mthalman is looking at)

I think the floating tag idea is largely understood at this point. It is pretty cheap to implement and would have specific terms of use for repo owners who want to use it.

There are two approaches we can use for the PR-driven strategy:

  • Upgrade version tags when they are stale, for example from Alpine 3.21 to 3.22.
  • Upgrade from one digest to another (potentially also across version boundaries)

Assuming we fund the tool at all, then the second option is one of the modalities. Most of the concerns on EOL images would (A) be a concern for the repo owners, and (B) could be mitigated by the update cadence. The update cadence hasn't been discussed. We're not required to have 24 hour turnaround on our repos in the general case for vulnerabilities, AFAIK.

richlander avatar Dec 18 '24 00:12 richlander

Offer a PR-driven strategy with a new tool (that @mthalman is looking at)

I've created a new issue for this: https://github.com/dotnet/dotnet-buildtools-prereqs-docker/issues/1321

mthalman avatar Jan 14 '25 16:01 mthalman