promo-tools icon indicating copy to clipboard operation
promo-tools copied to clipboard

image-promotion hits 429 quota limits

Open chrischdi opened this issue 10 months ago • 16 comments

What happened:

  • Image promotion job did run
  • Image promotion failed due tounexpected status code 429 Too Many Requests

See https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1776261613632884736

What you expected to happen:

  • Image promotion to succeed

How to reproduce it (as minimally and precisely as possible):

  • Run image promotion, propably multiple ones after another, in this case we (https://github.com/kubernetes-sigs/cluster-api-provider-vsphere) did cut 3 patch releases.

Anything else we need to know?:

This issue did already occur in the past and was reported wrongly at

  • https://github.com/kubernetes/k8s.io/issues/6431

Ben pointed that:

The image promoter makes a really high amount of API calls because of the approach to image signatures. We have not changed the quotas in the infrastructure projects.

So there may be potential to optimise promo-tools to not require that much API calls and to not exceed the limit.

Environment:

See the prowjob :-)

  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Others:

chrischdi avatar Apr 05 '24 17:04 chrischdi

I did try to look through the code a bit:

  • kpromo normally uses a rate-limiter when using the crane library
  • when using sigs.k8s.io/release-sdk/sign, to e.g. signAndReplicate (here) , kpromo does not set the transport to add the rate-limiter, because release-sdk does not allow us to.
    • release-sdk runs the SignImageInternal function:
      • https://github.com/kubernetes-sigs/release-sdk/blob/main/sign/impl.go#L113-L114
    • which runs github.com/sigstore/cosign/v2/cmd/cosign/cli/sign.SignCmd(...)
      • SignCmd would allow to pass through a Transport (and because of that a RateLimiter) via signOpts.Registry.RegistryClientOpts

chrischdi avatar Apr 08 '24 12:04 chrischdi

Instead of adding rate-limiting, the other possibility would be take a look into release-sdk and/or cosign to improve the api calls made.

chrischdi avatar Apr 08 '24 12:04 chrischdi

This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.

xmudrii avatar Apr 08 '24 12:04 xmudrii

This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.

What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.

sbueringer avatar Apr 08 '24 12:04 sbueringer

What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.

If promotion fails with error such as:

run `cip run`: promote images: signing images: replicating signatures: copying signature ...

It's generally safe to ignore it. If it fails with any other error, the job should be restarted. You can ping Release Managers in the #release-management Slack channel to restart the job for you.

It shouldn't affect ability to consume images, but signatures might not work properly or at all if this error happens. Unfortunately, there's nothing much we can do at this point, but we hope we'll be able to kick off the promo-tools refactor efforts soon.

xmudrii avatar Apr 08 '24 13:04 xmudrii

similar failures in the patch release and minor releases for CAPI today. one patch release failing at the signing stage: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780295493562142720

time="18:09:05.150" level=fatal msg="run `cip run`: promote images: signing images: replicating signatures: copying signature us-west2-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig to southamerica-west1-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: PUT https://southamerica-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/cluster-api/clusterctl/manifests/sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'. (and 1 more errors)" diff=4.378s
{"component":"entrypoint","error":"wrapped process failed: exit status 

and the minor release job failing at filtering edges: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780297426096099328

time="18:10:24.256" level=fatal msg="run `cip run`: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-central1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fcluster-api%2Fclusterctl%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=28ms
{"component":"entrypoint","error":"wrapped process failed: exit status 

cahillsf avatar Apr 16 '24 18:04 cahillsf

The first failure can be ignored, the second job should be restarted. Can you please send a link to the job so that we can restart it?

xmudrii avatar Apr 16 '24 18:04 xmudrii

@xmudrii thanks

sorry its this one: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780297426096099328

cahillsf avatar Apr 16 '24 18:04 cahillsf

@cahillsf Restarted the job and now it's green https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780300931636662272

xmudrii avatar Apr 16 '24 18:04 xmudrii

thanks for your help @xmudrii !

cahillsf avatar Apr 16 '24 18:04 cahillsf

possibly related: https://github.com/kubernetes-sigs/promo-tools/issues/842

BenTheElder avatar Apr 18 '24 21:04 BenTheElder

hit this with v1.30 release https://github.com/kubernetes/kubernetes/issues/126170

also the initial promo job didn't report failure, I think? but we didn't have all regions synced

BenTheElder avatar Jul 17 '24 20:07 BenTheElder

time="19:28:06.925" level=info msg="Registry: gcr.io/k8s-staging-scheduler-plugins Image: controller Got: gcr.io/k8s-staging-scheduler-plugins/controller" diff=141ms time="19:28:07.077" level=fatal msg="run cip run: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-west1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fsig-storage%2Fsnapshot-controller%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=152ms

I'm guessing there is a gap in using the rate-limit aware client.

BenTheElder avatar Jul 17 '24 20:07 BenTheElder

https://github.com/kubernetes-sigs/promo-tools/issues/842 ?

BenTheElder avatar Jul 17 '24 22:07 BenTheElder

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 15 '24 22:10 k8s-triage-robot

/lifecycle frozen

BenTheElder avatar Oct 16 '24 00:10 BenTheElder