docker-tools
docker-tools copied to clipboard
Add lifecycle annotations to images pushed to MCR
A broader set of services are starting to consume image annotations. Specifically Azure container scanning honors the lifecycle annotation to treat EOL images specially. As a result, the images produced with the .NET Docker infrastructure should have these lifecycle annotations added.
Related links:
-
Thread & Vulnerability Management guidance (MSFT internal link)
How can I mark an image in my ACR as end of life or dictate how long it should be scanned and reported on? Attach annotations* to the image called 'lifecycle annotations' which requires: The ACR Metadata Service to be onboarded for your registry (see callout below) [ORAS CLI](https://oras.land/docs/installation/) must be installed ORAS CLI must be authenticated to your registry [(See How to Authenticate ORAS CLI)](https://learn.microsoft.com/en-us/azure/container-registry/container-registry-oci-artifacts#sign-in-to-a-registry) You can use the ORAS CLI to attach lifecycle annotations by running the following command: Variables: $IMAGE refers to a fully qualified image name or digest (e.g. myregistry.azurecr.io/myimage:v1 OR myregistry.azurecr.io/myimage@sha256:fdbd1bc674f77e7a3b8129ecac062a9b7c22de74968e8ea3a46f071f61f3e99f) Lifecycle.end-of-life.date should be in the expected format of yyyy-mm-dd oras attach --artifact-type application/vnd.microsoft.artifact.lifecycle --annotation "vnd.microsoft.artifact.lifecycle.end-of-life.date=2023-05-12T21:15:10.8343521Z" $IMAGE
Work Items:
- [x] https://github.com/dotnet/docker-tools/pull/1334
- [x] https://github.com/dotnet/docker-tools/pull/1340
- [x] https://github.com/dotnet/docker-tools/pull/1358
- [x] https://github.com/dotnet/docker-tools/pull/1364
- [x] https://github.com/dotnet/docker-tools/pull/1367
- [x] https://github.com/dotnet/docker-tools/pull/1413
- [ ] https://github.com/dotnet/docker-tools/issues/1417
- [ ] Roll out
- [x] [dotnet-docker/nightly] Annotate and remove historical images
- [x] [dotnet-docker/nightly] Enable annotations in pipeline
- [ ] [buildtools-prereqs] Annotate and remove historical images
- [ ] [buildtools-prereqs] Enable annotations in pipeline
- [ ] [dotnet-docker/main] Annotate and remove historical images
- [ ] [dotnet-docker/main] Enable annotations in pipeline
- [ ] [dotnet-framework-docker] Annotate and remove historical images
- [ ] [dotnet-framework-docker] Enable annotations in pipeline
- [ ] Clean up
_placeholder tags that we pushed to MAR for historical images that were imported from MAR
[Triage] This needs some design work done before we start work on this. We need a way to figure out which images to tag as EOL and when without scanning every image in our ACR.
Principals:
- Images that are currently in support will not have EOL annotations. All other images will.
- EOL annotations should be added after we know an image is already EOL. We do not make the assumption that an image will be EOL in the future because we don't know for sure when that will be (e.g. sometimes we unexpectedly skip servicing for a given month).
- These annotations will apply to both images and multi-arch tags (or, more precisely, to manifests and manifest lists).
- Application of annotations should be automated and included as part of the overall image publishing workflow.
Scenarios
- .NET servicing (e.g. 8.0.1 => 8.0.2)
- Rebuilds outside of .NET servicing (such as when base image updates or forced rebuild for package security updates)
- Distro going out of support (such as when we stop publishing a given Alpine version for the rest of the lifetime of a .NET version)
- .NET major version going out of support
Known Good State
First we need to get to a known good state with some initial setup work. This involves setting EOL date annotation for all out-of-support manifests. We can debate how accurate these dates need to be. For example, all 5.0-based tags could be given an EOL date of 5.0's EOL date. Dangling manifests would probably just be set to an arbitrary date. But the need here is for all of these out-of-support manifests to have an EOL date in the past. Any manifest that is not contained in https://github.com/dotnet/versions/tree/main/build-info/docker is considered to be out-of-support.
Scenarios: Servicing and Rebuilds
To account for the scenarios of servicing and rebuilds, the logic can be integrated into the publishing pipeline. This logic would work by first getting the Dockerfile paths for all the images that are being published. It would then look up those paths in the image info file (prior to updating that file as part of the publish process) to see if that Dockerfile path had been previously published. If that Dockerfile path existed in the image info file, grab the image digest associated with it. That digest should be annotated with an EOL of today's date.
This also needs to check the multi-arch tags. For example, a tag like 6.0.28-alpine3.18 is EOL when 6.0.29 is published. That's a multi-arch tag so it's associated with multiple Dockerfiles. For each Dockerfile that's being published, check whether it's associated with any multi-arch tags. For each of those tags, find all associated images. Check whether all of those Dockerfiles are included in this current publishing run. If they are then the multi-arch tag (manifest list) should be EOL-annotated.
Scenarios: Distro/.NET out of support
To account for distro versions or .NET major versions going out of support, we need to account for Dockerfile removal. If a Dockerfile path is removed from the image info file as part of publishing, the digest associated with that file should be EOL-annotated. Similarly as with servicing, multi-arch tags should be accounted for as well. If a multi-arch tag has been removed from the image info file, the digest for it should be EOL-annotated.
This looks solid.
Perhaps we should delete all -preview tags (before 9.0) at the same thing. I don't see value in marking those for EOL (and leaving them).
Look good @mthalman. I had a few thoughts/questions as I reviewed your proposal:
- One scenario variant that we see in buildtools-prereqs is that support for an image variant is dropped. To me this is equivalent to a .NET major version going out of support in the official .NET Docker repo.
- Regarding Servicing and Rebuilds, you mentioned we would gather the EOL data during publishing. When would the adornment of the actual EOL annotations take place?
- There are likely some design/implementation challenges in regards to dates/timezones and publishing/ingestion occurring across a date boundary.
When would the adornment of the actual EOL annotations take place?
I was thinking it would either be at the very end of the publishing stage or as a separate post-publishing stage.
We are certain to get the adornment wrong at least once. We need to ensure we can run this stage on its own after quickly altering some data file.
FYI @tmds @baronfel
In addition to the EOL annotation we should consider filling out the other predefined OCI image annotations: https://github.com/opencontainers/image-spec/blob/main/annotations.md
(the EOL annotation is not part of the OCI spec)
When would the adornment of the actual EOL annotations take place?
I was thinking it would either be at the very end of the publishing stage or as a separate post-publishing stage.
This needs to be expanded a bit. There are several questions to consider:
What time should the EOL value represent? Is it based on when the current images have been successfully ingested by MAR? What happens if the images are technically available but MAR's ingestion times out (this is not uncommon)? Does each image have its own separate time based on its ingestion or do the images being updated share a common time?
What happens if the images are technically available but MAR's ingestion times out (this is not uncommon)?
Are you referring to situations when MAR's tag listing is out-of-sync with the actual container registry?
Waiting until new images are available on MAR seems like a requirement before setting the annotation on older images and pushing that change. We do not want to have a possible situation where we could have no images available without EOL annotations. I am not sure we can/should do anything additional if the MAR status API reports that artifact and content ingestion are both completed.
What time should the EOL value represent? Is it based on when the current images have been successfully ingested by MAR? Does each image have its own separate time based on its ingestion or do the images being updated share a common time?
The simplest solution would be to use the current time whenever we are ready to push EOL annotations. That's independent of when MAR image and content ingestion are completed. Alternatively, we could mark the MAR availability time in the image-info file.
Are you referring to situations when MAR's tag listing is out-of-sync with the actual container registry?
I'm referring to situations where MAR's status API keeps returning in progress for the image because they have some internal issue on their side but the new image is actually available to pull. In those cases, it is typically necessary to rerun the publish with "wait for ingestion" disabled. That would allow any annotation operations after that point to be executed.
Design proposal
Outline
New Annotation stage will be added to build pipeline.
Existing publishing stage will be updated to produce the EOL Digests list (EolDigests.json).
Image Builder would be extended with two new commands, one for generating a list of EOL digests, and the other for applying EOL annotations.
Pipeline changes
- Pass the location of data file as a parameter for Annotation stage.
- Pass the current date as a parameter for Annotation stage.
- Run the annotation stage after publishing stage.
- It would be valuable to block this stage with approval request, for the first few runs, to validate the produced data file.
Publishing stage changes
In a separate pipeline step, we will download current image-info file before updating it with new one. After new image-info file was generated and uploaded, we will run the new command to determine which images should be annotated using the following logic:
- For each Dockerfile in old image-info, find the corresponding entry in new file.
- If Dockerfile is missing, add old digest value to data file.
- If Dockerfile is present, but digest values are different, add old digest values to data file, ensuring that we cover multi-arch scenarios.
Annotation stage:
Annotation stage can be run independently, if necessary, to fix potential issues, i.e. missed or unwanted annotations. Annotation stage would do the following:
- Download the annotation tooling.
- Obtain the data file using the value (path/Uri) supplied in
EolDigestListparameter. - If value of the second parameter (
EolDate) is not provided, it will default to an empty value. - Run image builder with annotation command.
Data-file processing logic
- For each digest we execute the annotation tool to update the EOL info for this docker image.
- If date is empty/null, we remove annotations from this docker image. There is an 'open question' about this.
Image Builder changes
- New command for generation of Eol Digests data file -
GenerateEolDigestsList - New command for docker image annotation -
ApplyEolDigestAnnotations
Data file
Data file contains a list of old image digests that should be annotated. Digests are obtained from image.info file, before the file is updated with new digests.
Open questions
- Should data file also contain a date value for each digest?
- Date value could be supplied as a stage parameter - i.e. current date
- If supplied date value is null or empty, we would remove annotations of all digest in the data file, essentially undoing the step, if we're running this stage
- Having a date value for each digest would enable fine tuning of the data file, if a rerun of the annotation stage is needed.
- Should data file be committed to a repo (dotnet-docker or dotnet-buildtools-prereq) as a uniquely named file, i.e. name contains build ID?
- This will allow easy modifications of the file in case tweaks are needed
- Annotation stage would not need a data file parameter
- Does ORAS CLI tool support annotation removal? If not, We need to understand which other tools/process can be used.
Not mentioned here is having a standalone pipeline which can update annotations independently of the image publishing pipeline.
We should also consider some validation behavior to ensure we're operating with correct state. For example, the normal scenario should never get into a state where it's setting an EOL annotation for a digest which already has an EOL annotation. It should produce an error in that case. But in cases where the standalone pipeline is used, this overwriting of an existing annotation should be allowed.
Download the annotation tooling.
The ORAS tool should be included in the Image Bulder image. It can be installed by the Dockerfile: https://github.com/dotnet/docker-tools/blob/main/src/Microsoft.DotNet.ImageBuilder/Dockerfile.linux. We only need it in the Linux Dockerfile, not Windows, because we can just execute the annotation logic on a Linux agent.
Should data file be committed to a repo (dotnet-docker or dotnet-buildtools-prereq) as a uniquely named file, i.e. name contains build ID?
As long as it's stored as a pipeline artifact, that should be sufficient.
@NikolaMilosavljevic - We'll also need to account for the existing EOL images which need annotations. That's just a one-time operation to get that state set. I chatted with @MichaelSimons about this and we had some ideas on a strategy here.
We have historic Kusto data for all the images we've published going back to .NET Core 3.1 I believe. We can use that data to determine all the digests that have ever been published for each .NET version. So, for example, we could find all digests associated with .NET Core 3.1 and set their EOL annotation to the 3.1 EOL date.
For .NET versions that are currently in support, we'd have to supplement with the image info file data. Any digest that exists in the image file is currently in support and should be excluded from this EOL annotation operation.
For any other images (e.g. < 3.1) that are unaccounted for, they could just be set to an arbitrary date in the past.
@NikolaMilosavljevic - We'll also need to account for the existing EOL images which need annotations. That's just a one-time operation to get that state set. I chatted with @MichaelSimons about this and we had some ideas on a strategy here.
We have historic Kusto data for all the images we've published going back to .NET Core 3.1 I believe. We can use that data to determine all the digests that have ever been published for each .NET version. So, for example, we could find all digests associated with .NET Core 3.1 and set their EOL annotation to the 3.1 EOL date.
For .NET versions that are currently in support, we'd have to supplement with the image info file data. Any digest that exists in the image file is currently in support and should be excluded from this EOL annotation operation.
For any other images (e.g. < 3.1) that are unaccounted for, they could just be set to an arbitrary date in the past.
For this one-time operation we could use the standalone pipeline and supply the list of digests collected from Kusto.
From my post further above:
To account for distro versions or .NET major versions going out of support, we need to account for Dockerfile removal. If a Dockerfile path is removed from the image info file as part of publishing, the digest associated with that file should be EOL-annotated. Similarly as with servicing, multi-arch tags should be accounted for as well. If a multi-arch tag has been removed from the image info file, the digest for it should be EOL-annotated.
This needs to be amended a bit. We need to consider our lifecycle policy for a .NET version becoming EOL:
When a .NET version reaches EOL, its tags will continue to be maintained (rebuilt for base image updates) until the next .NET servicing date (typically on "Patch Tuesday", the 2nd Tuesday of the month).
Based on the algorithm I described above, we wouldn't actually set an EOL annotation on the Dockerfiles during this one-month period. But that's not the behavior we want. We want the EOL annotation to be set on the actual EOL date, not wait until we finally remove the Dockerfiles from the repo. To account for this scenario, we would need to know, programmatically, that the images currently being published are EOL. This can be done by reading the EOL date from the Release data using the Microsoft.Deployment.DotNet.Releases API.
In addition to storing the annotated digest list as a pipeline artifact, we should also include this list of digests in the publish notification step. See https://github.com/dotnet/dotnet-docker-internal/issues/5295 (internal link) as an example of such a notification. The status of the annotation update step should also be included in the publish notification. For that reason, the publish notification would obviously need to run after the annotations are set.
That may impact how we structure the pipeline. The design proposal indicated that the annotation would be a separate stage. But that would force the publish notification step to be after that. We could consider just including all these steps in the publish stage. Is there a need to have separate stages?
In addition to storing the annotated digest list as a pipeline artifact, we should also include this list of digests in the publish notification step. See dotnet/dotnet-docker-internal#5295 (internal link) as an example of such a notification. The status of the annotation update step should also be included in the publish notification. For that reason, the publish notification would obviously need to run after the annotations are set.
That may impact how we structure the pipeline. The design proposal indicated that the annotation would be a separate stage. But that would force the publish notification step to be after that. We could consider just including all these steps in the publish stage. Is there a need to have separate stages?
There is no real need for a separate stage. These can all be steps in a publishing stage.
The other request is to implement new standalone pipeline for annotations, to process images that are already EOL. But, we could share the annotation step between these two pipelines.
Based on the algorithm I described above, we wouldn't actually set an EOL annotation on the Dockerfiles during this one-month period. But that's not the behavior we want. We want the EOL annotation to be set on the actual EOL date, not wait until we finally remove the Dockerfiles from the repo. To account for this scenario, we would need to know, programmatically, that the images currently being published are EOL. This can be done by reading the EOL date from the Release data using the Microsoft.Deployment.DotNet.Releases API.
Due to this, we'll have different EOL dates for some digests in EolDigests list (json file). It is necessary for this list to have EOL date specified for each digest, which was the original proposal.
We could do one or both of the following:
- Have a shared EOL date in json file, and optional EOL dates at digest level
- Allow override of EOL date, using a pipeline parameter, passed as an argument to ImageBuilder's annotation command
Open question: How do we specify EOL annotations data for the new standalone pipeline?
Main pipeline would consume this data from artifacts, as the data is generated in the 'eol planning' step, but standalone pipeline needs to accept data, via variable/parameter, using one of these models:
- Provide complete json string as a value of pipeline variable or parameter - unsure of the string limits and user experience
- Provide a filename of the json file under version control, i.e. in
versionsrepo, in predefined location (dir) - Provide a URL of the json file, in blob storage, for instance - potential security implications
Open question: How do we specify EOL annotations data for the new standalone pipeline?
Main pipeline would consume this data from artifacts, as the data is generated in the 'eol planning' step, but standalone pipeline needs to accept data, via variable/parameter, using one of these models:
- Provide complete json string as a value of pipeline variable or parameter - unsure of the string limits and user experience
- Provide a filename of the json file under version control, i.e. in
versionsrepo, in predefined location (dir)- Provide a URL of the json file, in blob storage, for instance - potential security implications
I see the use of the standalone pipeline as very much a special case scenario, not something that would be used regularly. For that reason, I see no issue in running the pipeline to target an internal custom branch in the repo (e.g. dotnet/dotnet-docker) where the JSON file could be stored.
We'll also want to query the status of the ingestion of these annotations by MAR. This is done today with image tags via the WaitForMcrImageIngestionCommand class. Specifically, it's this service call that provides this info:
https://github.com/dotnet/docker-tools/blob/902feed98941fd2255455007b6edc167cb0e2976/src/Microsoft.DotNet.ImageBuilder/src/McrStatusClient.cs#L43-L47
According to the MAR team, that service API can also be used to query the status of these annotations. We just need to know the digest of the annotation that was pushed to our ACR. We'd then incorporate the querying of those digests as part of the wait logic in WaitForMcrImageIngestionCommand to ensure that all annotations have been ingested.
I've been thinking about another scenario that we need to account for. In some cases, the pipeline doesn't have a successful run. For example, it may timeout while waiting for images to be ingested by MAR. In that scenario, images will have been published but, because the timeout occurred, it never executed the step which updated the image info file. A retry would potentially resolve this. However, if the build was originally executed by AutoBuilder, and not followed up with a manual retry from a team member, another build will get executed by AutoBuilder later in the day. The images in the second build will supersede the images from the previous one, causing the older ones to be dangling images. Because those previous images were never represented in the image info file, they never get annotated.
One could argue that the publishing of the image info should happen earlier. But that could lead to false data in cases where there is some issue with MAR where it didn't ingest any images at all. Or there may even be a failure in the act of publishing the image info file.
The point is that the image info file doesn't necessarily describe all the images we need to consider. But can we constraint the scenarios that we need to consider for this? Constraining the scenarios is prudent because if you leave it wide open, then you'd essentially have to query every single digest in the registry checking to see if it needs to be annotated. That would be an expensive operation. If we could constrain the number of queries we needed to make, that would be better. One option would be to query for any existing images that are currently tagged with the tags that are being published by the build (this would obviously need to be queried before the current set of images get published). Another option is to query for previously failed builds and grab their image info files to limit what digests should be queried.
The point is that the image info file doesn't necessarily describe all the images we need to consider. But can we constraint the scenarios that we need to consider for this? Constraining the scenarios is prudent because if you leave it wide open, then you'd essentially have to query every single digest in the registry checking to see if it needs to be annotated. That would be an expensive operation. If we could constrain the number of queries we needed to make, that would be better. One option would be to query for any existing images that are currently tagged with the tags that are being published by the build (this would obviously need to be queried before the current set of images get published). Another option is to query for previously failed builds and grab their image info files to limit what digests should be queried.
Another option that was thought of was to use Log Analytics on the ACR to get definitive data on what images had been pushed to the registry. This wouldn't require our own logging mechanism that would be subject to failure and unreliability. This had been determined to be the way to go. But upon implementation, it's been discovered that pushes of Windows images to the ACR produce "importImage" operations in the log instead of "Push" operations. Ok, that's fine. But the issue is that the Digest field is empty for these importImage operations. That's the key bit of information we need in order to find potential dangling images to be annotated.
Proposing a different direction that should greatly simplify things but has a broader impact on the use of the registry.
The design proposals above have been focused on dealing with the fact that there are thousands of manifests in the registry's repos. Because of that number, it's not feasible to query through all of them to determine what needs to be annotated.
The new proposal is all about cleaning out old images from the registry. Once an image has been EOL-annotated, it can be deleted from the registry (or given a grace period before deletion). This greatly reduces the number of manifests we would need to query to determine what needs to be annotated. Because we can then query all of the manifests without that big of a performance impact, it allows us to simply query for all digests that are not in the list of currently supported digests. Those would then be annotated and subsequently deleted from the registry.
This would then not rely on comparing the new image info versus the old image info. It only needs to take into account the state of the new image info.
Deletions is a hard topic. I thought we were not keen on that. I think we'd need to collection some data on how prevalent digest references are. We are certain to break builds. If we did this, we'd need to publish guidance that digest references are only supported as a temporary measure.
In terms of deletions, I'd start with previews. I see that as entirely fair game. I think we should delete all preview images that are > 3 months old. That should remove a lot of digests. In the case of STS, the preview and supported period are close to equal.
This is not about deleting images from MAR. It's only about deleting them from our ACR. That deletion won't propagate to MAR.
Oh! I see. If I reverse engineer what you are saying, I think you are positioning ACR as a backup for MAR, and EOL annotated images don't need that?
I still think we should delete all preview images.
This sounds like a good direction to move in - I am struggling to come up with a situation where we'd need an old image from our ACR. It seems like "replace a broken image with an older version" would almost never be preferable to "replace a broken image with a new build from a previous version of the Dockerfile".
The only situation I could think of is: 1. We released a broken image + 2. a distro's package feeds are down at the same time. But in that situation, we could probably just use old images from MAR.