Enable Image Signing
Overview
This issue will cover the implementation details to enable https://github.com/dotnet/dotnet-docker/issues/4589
I have created a "vertical slice" of the signing infrastructure to verify that everything works. This is in the form of a pipeline that takes a pre-built image, signs it using our signing service, and verifies the signature. The pipeline code is here [internal MSFT link].
How images are signed
- Build the Images
- Generate Signing Payloads: Once images are built, generate a signing payload for each manifest and manifest list we intend to sign. The payload contains the OCI descriptor for the image to be signed. The descriptor can be retrieved by running the ORAS CLI.
- Sign Payloads: Send all of the payloads off to the signing service at once. Our signing service will update the payloads in-place with the signed versions.
-
Attach Signatures: Next, the signed payload must be attached to the image as an artifact. This is done with the ORAS CLI.
- Using the signed payload, read the COSE signature to compute the x509 certificate chain. This is added to the signature artifact as an annotation. There is a reference implementation for this here in python, that we'll need to port over to .NET.
How images are verified
- Download the root and/or issuer certificates corresponding to the signing key used for the images. All official and test certificates are accessible from public download links.
- Add the certs to the Notation CLI in a new trust store.
- Define and import a Notation Trust Policy that trusts all of the certs in the aforementioned trust store.
- Then, you can verify the image with the notation CLI.
Design Proposal
Prerequisites:
- [x] https://github.com/dotnet/docker-tools/pull/1311
- [ ] Add Notation CLI to ImageBuilder Image
Pipeline Changes
Signing must take place after all images are built. We'll need a list of all of the digests we built in order to know what to sign. Our usual image-info.json file is a natural fit. That means signing will need to wait until after the Post-Build stage, which assembles the image info file describing all of the images we built. Furthermore, if we wish to sign manifest lists, we'll need to wait until after the Publish Manifest stage (not the most descriptive name), which creates and pushes all of the docker manifest lists for multi-platform tags. It also seems like we could consider moving the manifest list creation into the post-build stage as well.
The pipeline changes should allow signing with either test keys or production keys, and also the provide option to skip signing altogether.
ImageBuilder Changes
Since we need to send signing payloads off to an external service via a pipeline task, the signing implementation in ImageBuilder must be split into at least two separate parts: before and after sending signing payloads to the signing service.
New Command: GenerateSigningPayloads
- Inputs: image-info.json file, output directory for signing payloads.
- Outputs: places one signing payload per image in the output directory. We may also find it useful to output an info file mapping each digest to its payload file's path, which we could pass to the next stage.
New Command: AttachSignatures
- Inputs: One of either (path to directory containing signing payloads) or (output file from
GenerateSigningPayloadscommand) - Outputs: List of signature digests.
This will be similar to ApplyEolDigestAnnotations from the current image lifecycle annotation work. There is opportunity to share parts of the implementation here.
In the case of failure, the command should output a list of digests and/or payloads that did not get their signature attached. We should also investigate what happens when there are multiple attempts to attach signatures to the same digest. There should not be two signatures referring to the same digest. One solution could be to remove any old signature annotations and re-attach newly created annotations on a re-run of the pipeline.
Verifying signatures
This is relatively straightforward - this could be done during our test leg, or immediately after attaching signatures. Notation CLI and ORAS only interact with registries and not the local Docker installation. This means verifying signatures does not require pulling any images, so this should be a relatively lightweight process. Performing this check outside of ordinary test infrastructure also means that this check could run on all repos without any test infrastructure changes.
Unknowns
Does ACR import between ACRs leave artifacts and referrers intact?
We may consider using ORAS to move images between ACRs instead of acr import, to keep signatures intact.
Effect on our ACR and clean-up schedule
In order to maintain integrity of the verified content, signatures exist in the registry just like an image, and simply refer to the digest of the image to be verified. Since signatures are stored in the registry just like other artifacts, they also may need to be cleaned up (needs more investigation) along with the images they refer to. In our cleanup pipelines, we may consider using ORAS to check for and delete associated artifacts when cleaning up old images in our ACR.
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
we'll need to wait until after the Publish Manifest stage
If I follow, this implies signing would happen in the publish stage. You also mention:
verifying signatures does not require pulling any images, so this should be a relatively lightweight process. Performing this check outside of ordinary test infrastructure also means that this check could run on all repos without any test infrastructure changes.
Is that to say all signing verification will happen in the publish stage and not in the testing stage? Are there alternative options? e.g. is it possible to sign before tests? There seems to be some pros and cons here.
What is your estimate on the perf impact of signing a full build?
If I follow, this implies signing would happen in the publish stage.
Is that to say all signing verification will happen in the publish stage and not in the testing stage?
To be more clear, the only constraints for when signing happens is that all images and manifest lists to be sign must be pushed to the registry. If we move the manifest list creation step to the Post-Build stage, then we should be free to sign and validate signatures any time during or after the Post-Build stage.
What is your estimate on the perf impact of signing a full build?
I will work on creating a synthetic test to measure how long it takes for the signing service to process the worst-case number of payloads. That will help inform whether we need to run signing in parallel to any other stages.
The largest number of digests we publish at once (on Patch Tuesdays) is somewhere around 542. I got this number by counting the number of unique lines containing "digest" in dotnet-docker's image-info file:
(get-content .\image-info.dotnet-dotnet-docker-main.json | select-string -pattern "digest" | foreach-object { $_.line.trim() } | get-unique).count
I tested this worst-case scenario with our signing service. COSE signing 500 unique json payloads with the test key took 6m 18s end-to-end (time spent running the AzDO task). This is also with inline validation enabled, meaning that supposedly the service will validate the signature on the files before it hands them back - we should still validate the images anyways of course though.
My first thought is that it seems valuable to run this process in parallel to tests, since that is a fair amount of time. Validating the signatures on the images ourselves will also likely take at least a couple of minutes on top of that I assume. What are others' thoughts? @mthalman @MichaelSimons
Something like this:
flowchart LR
Build --> Post-Build --> Test --> Publish
Post-Build --> Sign --> Publish
Yep, I'm on board with that workflow. I assume both signing and validation would happen in the Sign stage? That seems preferable to doing anything in Post-Build because that would just delay getting the Test stage started.
I'm on board as well.