Improve behavior for ClusterImagePolicy with multiple keys
Description
When providing a ClusterImagePolicy with multiple (N) keys in spec.authorities, policy-controller loops over all keys, calling cosign package to perform verification, which in turn pulls the signature image N times.
https://github.com/sigstore/policy-controller/blob/v0.3.0/pkg/webhook/validator.go#L610-L614
https://github.com/sigstore/policy-controller/blob/v0.3.0/pkg/webhook/validation.go#L47-L64 - looping over keys
https://github.com/sigstore/policy-controller/blob/v0.3.0/pkg/webhook/validation.go#L73-L81 - calling cosign.VerifyImageSignatures
The cosign.VerifyImageSignatures call pulls image, which in turn means that having N signatures is calling the image registry N times (or until a valid signature is found).
So, with 15 keys and 3 signatures attached to an image, this means the pull will be done 45 (15 * 3) times.
One option to improve is to download the signature to a local file and use cosign.VerifyLocalImageSignatures which is not doing any networking and should be much faster.
https://github.com/sigstore/policy-controller/blob/v0.3.0/pkg/webhook/validator.go#L610-L614 also mentioned a comment about what to do with multiple keys, I am not sure if this should be same or another issue, but I'll add it here as it may be related.
I suggest this defaults to any key working as it is today, but ideally, spec.authorities could have a field like threshold or minimumSignatures that would specify minimum number of signatures that have to be present - for example 3 out of 5. This would be similar to the threshold option in connaisseur, as documented here.
I am using and checked the source code for version v0.3.0.
Initially I thought this is pulling the signature OCI image N * M times (N = number of public keys, M = number of signatures) and that's what I wrote. Then I realized the signature is only pulled once in the loop over public keys, so it is pulling the image N times. Updated description just now.
@wojciechka We are working on improvements regarding the overall performance of the controller. I agree we could improve that piece of code to be more effecting when pulling the signature, but it also depends on the type of algorithm of the public key (e.g. sha512). Maybe we could split that function in a way where the signature is pulled before creating the verifier. On the other hand, we are pulling the manifest, not the image.
(...) but it also depends on the type of algorithm of the public key (e.g. sha512).
@hectorj2f Good point! I did not think of that - different algorithms will yield a different image ref.
Perhaps there a short-lived cache of pulled signatures could work - either only for duration of the single request or a short-lived cache (i.e. 1 or 5 minutes and the time could be made configurable) for signatures that were successfully validated? This would reduce amount of work needed when rotating or creating large number of pods with same images.
In our case we also run multiple sidecar containers (i.e. vault-agent, telegraf) so not having to query those images every time would make it less error prone.
This may mean changes in policy-controller, sigstore and/or cosign at the same time, so maybe it's not simple from implementation perspective.
On the other hand, we are pulling the manifest, not the image.
Indeed, it's not pulling a lot of data, but each https request piles up total time, especially if the registry with signatures is in another region.
The timeout for webhook is 10s and pulling the signature 15 times for 15 public keys is already taking 1-2 seconds on average where registry and the cluster are both in US-based locations.
I am hoping to find out if anything is happening with this issue?
The reason I ask is that I've seen this cause another problem - with around 10 keys and more workloads enabled, policy-controller may be triggering quotas/limits on pulling images from some repositories - such as Google Artifact Registry. Last week I saw it timing out to pull signatures for any image, failing to validate updates to Deployments, StatefulSets and other objects that include images. Removing the namespace annotation enabling policy-controller for a few hours and then enabling it again solved it, it seems like it was throttling and just pausing for a while resolved it.
I saw some caching scaffolding being added, I suspect the next step is to add it to places where signatures are pulled? I do not think I understand policy-controller enough to tackle this entirely on my own, but I could try to implement it with enough guidance from folks that know the code better.
@wojciechka We have some work around the cache, but I don't have a specific date when this feature could land. I'll let you know as soon as I know.