dotnet-buildtools-prereqs-docker we should consider removing build tools from Helix images

I somewhat stumbled on this one while running out of space on my local machine. I noticed that most of the helix images are pretty large. For example Ubuntu 18 -> 1.5G

$ docker image ls
mcr.microsoft.com/dotnet-buildtools/prereqs         ubuntu-18.04-helix-amd64-20230216023557-4443d0d   6d7d1241e2c6   7 seconds ago   1.52GB

It seems like many of the images are pulling build tools directly or indirectly by relaying on the prereq base image. And in both cases they seems to be geared toward building runtime not executing tests. This may be for historical reasons.

I did little fiddling (https://github.com/dotnet/dotnet-buildtools-prereqs-docker/compare/ubu18?expand=1) and I can get the image to about 1/3

mcr.microsoft.com/dotnet-buildtools/prereqs         ubuntu-18.04-helix-amd64-20230216060058-4443d0d   07e33080f83d   5 seconds ago    593MB

even less with restricted locale

mcr.microsoft.com/dotnet-buildtools/prereqs         ubuntu-18.04-helix-amd64-20230216061525-4443d0d   c2d0fb3b5d9c   52 seconds ago   463MB

is the ~ 1G per image saving interesting and worth of the troubles @MattGal @mthalman ? I'm reasonably confident that the resulting image would be capable of running runtime tests. But I don't have really visibility to other repos and any historical reasons. If we decouple the dependency on the base local image we may see more duplication on Dockerfile - but probably not too much.

Feb 16 '23 06:02 wfurt

cc @ChadNedzlek

1 GB of space is significant and would help overall with the overall storage costs associated with the images we produce.

Feb 16 '23 14:02 mthalman

No objection to installing less stuff, other than to point out the obvious that there will be some teething pains in getting the new list correct and some tests will fail.

To my knowledge, most of the "why the test images look like the build images" is just an artifact of folks being new to using docker and there being only that list of dependencies available back when the images were created.

That said, why would you even have a ubuntu-18.04-helix-amd64... image? Even with smaller images it would be significantly more performant to just run on the ubuntu.1804.amd64.* helix "normal" queues. Docker should be for images we don't already have readily available in Helix, since you don't have to do a docker pull to have them. If it's about needing things like msquic installed, we can add artifacts (and indeed, the work going on right now is to unify how this works) to make the non-Docker versions the way you want.

Feb 16 '23 17:02 MattGal

I agree with the point @MattGal. There will be some pain so the question is if it would be worth it. (as guess) And if we feel like it, it may need coordination beyond runtime.

I pick ubuntu-18.04 as an example. But the other Helix images also look somewhat large:

mcr.microsoft.com/dotnet-buildtools/prereqs         alpine-3.17-helix-amd64-20230210201636-609d24f    67b176d4e97d   5 days ago       2.1GB

runtime tests do consume many docker images and I'm not sure if there is desire (or possibility) to replace all of them with full queues. And I don't have visibility to operational and maintenance cost. While the Docker is not quite the same as it lacks matching Kernel I do like the ease of updates and using docker to investigate test failures.

I would like to reach consensus before jumping to sweeping cleanup. (We may do it opportunistically as we onboard now OS versions)

Feb 16 '23 18:02 wfurt

The "ease of updates" is going to go away with some new changes, as they are going to be managed identically to the VM images Matt's talking about, but investigating failures is an interesting bit. The VM images can also be used to spin up an Azure VM, so it's still possible to use, and we could potentially work at making that easier.

Feb 16 '23 20:02 ChadNedzlek

making updates more difficult is ... interesting. I personally see problem with the VMs as:

takes long time to start
adds additional cost
does not allow direct access via VPN
lacks development support (for example, I can easily map new artifacts from my dev machine to docker as well as share tools and scrips)

Feb 17 '23 02:02 wfurt

I personally see problem with the VMs as:

takes long time to start

Taking longer for "plain" helix VMs to start than docker ones is impossible. Our "docker hosts" are the plain Helix VMs, if you send work to ubuntu.1804.amd64.open@some-ubuntu-1804-dockertag, you are literally first spinning up an ubuntu 18.04 machine, doing all the same steps as a normal helix work item, then downloading all the layers of the docker image, then starting the container. There's no way this is faster than the same thing minus two steps.

adds additional cost

Again I disagree. The docker scenario slows things down for the above reasons, VMs cost money per hour, so even though the connection to Microsoft container registry from within the data center is fast and free, the extra time of doing a docker Helix work item definitely costs more than the time not doing it.

does not allow direct access via VPN

I can try to poke more on this with the DDFUN folks. I agree that the repro machines not being available off corpnet is not an inclusive behavior for our remote coworkers.

lacks development support (for example, I can easily map new artifacts from my dev machine to docker as well as share tools and scrips)

No argument here, but you do still have to merge a pull request and wait in both cases to change test agent behavior.

Feb 17 '23 16:02 MattGal

I'm talking about developer experience @MattGal. I have no visibility to the operational part. I can run container on my laptop as long as I want to and start/restart is fast. And no emails about running machines and cost saving.

It is also easy to prototype and test changes.

Feb 17 '23 17:02 wfurt

Ah. Yeah no disagreement there, prototyping is a Docker strong point

Feb 17 '23 17:02 MattGal