promo-tools
promo-tools copied to clipboard
image-promotion hits 429 quota limits
What happened:
- Image promotion job did run
- Image promotion failed due to
unexpected status code 429 Too Many Requests
See https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1776261613632884736
What you expected to happen:
- Image promotion to succeed
How to reproduce it (as minimally and precisely as possible):
- Run image promotion, propably multiple ones after another, in this case we (https://github.com/kubernetes-sigs/cluster-api-provider-vsphere) did cut 3 patch releases.
Anything else we need to know?:
This issue did already occur in the past and was reported wrongly at
- https://github.com/kubernetes/k8s.io/issues/6431
Ben pointed that:
The image promoter makes a really high amount of API calls because of the approach to image signatures. We have not changed the quotas in the infrastructure projects.
So there may be potential to optimise promo-tools to not require that much API calls and to not exceed the limit.
Environment:
See the prowjob :-)
- Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Others:
I did try to look through the code a bit:
- kpromo normally uses a rate-limiter when using the crane library
- when using
sigs.k8s.io/release-sdk/sign
, to e.g.signAndReplicate
(here) , kpromo does not set the transport to add the rate-limiter, because release-sdk does not allow us to.- release-sdk runs the
SignImageInternal
function:- https://github.com/kubernetes-sigs/release-sdk/blob/main/sign/impl.go#L113-L114
- which runs
github.com/sigstore/cosign/v2/cmd/cosign/cli/sign.SignCmd(...)
-
SignCmd
would allow to pass through a Transport (and because of that a RateLimiter) viasignOpts.Registry.RegistryClientOpts
-
- release-sdk runs the
Instead of adding rate-limiting, the other possibility would be take a look into release-sdk and/or cosign to improve the api calls made.
This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.
This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.
What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.
What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.
If promotion fails with error such as:
run `cip run`: promote images: signing images: replicating signatures: copying signature ...
It's generally safe to ignore it. If it fails with any other error, the job should be restarted. You can ping Release Managers in the #release-management
Slack channel to restart the job for you.
It shouldn't affect ability to consume images, but signatures might not work properly or at all if this error happens. Unfortunately, there's nothing much we can do at this point, but we hope we'll be able to kick off the promo-tools refactor efforts soon.
similar failures in the patch release and minor releases for CAPI today. one patch release failing at the signing stage: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780295493562142720
time="18:09:05.150" level=fatal msg="run `cip run`: promote images: signing images: replicating signatures: copying signature us-west2-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig to southamerica-west1-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: PUT https://southamerica-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/cluster-api/clusterctl/manifests/sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'. (and 1 more errors)" diff=4.378s
{"component":"entrypoint","error":"wrapped process failed: exit status
and the minor release job failing at filtering edges: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780297426096099328
time="18:10:24.256" level=fatal msg="run `cip run`: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-central1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fcluster-api%2Fclusterctl%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=28ms
{"component":"entrypoint","error":"wrapped process failed: exit status
The first failure can be ignored, the second job should be restarted. Can you please send a link to the job so that we can restart it?
@xmudrii thanks
sorry its this one: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780297426096099328
@cahillsf Restarted the job and now it's green https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780300931636662272
thanks for your help @xmudrii !
possibly related: https://github.com/kubernetes-sigs/promo-tools/issues/842
hit this with v1.30 release https://github.com/kubernetes/kubernetes/issues/126170
also the initial promo job didn't report failure, I think? but we didn't have all regions synced
time="19:28:06.925" level=info msg="Registry: gcr.io/k8s-staging-scheduler-plugins Image: controller Got: gcr.io/k8s-staging-scheduler-plugins/controller" diff=141ms time="19:28:07.077" level=fatal msg="run
cip run
: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-west1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fsig-storage%2Fsnapshot-controller%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=152ms
I'm guessing there is a gap in using the rate-limit aware client.
https://github.com/kubernetes-sigs/promo-tools/issues/842 ?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/lifecycle frozen