helpdesk icon indicating copy to clipboard operation
helpdesk copied to clipboard

[infra.ci.jenkins.io] Builds stucks due to GH API rate limit

Open dduportal opened this issue 1 year ago • 21 comments

Service(s)

infra.ci.jenkins.io

Summary

Every time we apply a new JCasc change to infra.ci.jenkins.io, the JobDSL refresh (reapply?) kicks a lot of Multibranch pipeline scanning, which makes us reaching the "imposed" GH rate limit, locking builds for minutes or even 1 hour sometimes (time to get a renewed rate limit).

This unwanted behavior is slowing us down almost every weeks now: a fix or compensation measure is required.


There are multiple solutions here (to be discussed and reported in team meeting when triaging this issue)

  • (to validate) Use skipInitialBuildOnFirstBranchIndexing() for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)
  • Disable tag discovery on all jobs not requiring it
  • Use multiple GitHub Apps to spread the load between jobs
  • Others?

Reproduction steps

No response

dduportal avatar Jul 09 '24 07:07 dduportal

* (to validate) Use `skipInitialBuildOnFirstBranchIndexing()` for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)

Tested this scenario on a local instance:

  • Once the first MB scanning is finished with success, the following JCasc reload did NOT trigger anymore scanning (unless triggerered manually or by a cron).
  • I was able to try a git hook (easy one) and it triggerered a scan and build. Let's see how it behaves with the GitHub webhooks on production for infra.ci.

Todo list:

  • [x] Implement the chart JobDSL code: https://github.com/jenkins-infra/helm-charts/pull/1244
  • [x] Deploy the new setup on infra.ci.jenkins.io
  • [x] Validate the new behavior when reloading JCAsc after the first initial wave

dduportal avatar Jul 30 '24 09:07 dduportal

Update: after 2 weeks of usage, the amount of "reaching the rate limit" drastically decreased. But it still happen when touching the jobs configuration.

After checking, it does not look like there are techniques to prevent re-indexing using Multibranch pipelines. It's only a feature on Organization folders.

We could redefined the jobs configuration tree using Organization folder but at the cost of:

  • Grouping jobs with the exact same MB level settings per top-level Org folder (as we cannot have configuration exception).
    • Good for Terraform and Docker jobs. Eventually for the websites PR. But bad for all other jobs.
  • And implementing the GH org. folder in our jenkins-jobs helm chart

On short term, I'll start splitting into multiple GH app per directories to unblock us.

dduportal avatar Aug 12 '24 09:08 dduportal

First try:

  • Created a new GH app named infra.ci.jenkins.io-docker-jobs
  • Only has access to 21 repositories defined in https://github.com/jenkins-infra/kubernetes-management/blob/589bd47280d110b170c1cde1aefdb82ea585145c/config/jenkins-jobs_infra.ci.jenkins.io.yaml#L21 (except jenkinsci/acceptance-test-harness which needs a special installation or a distinct GH credential)
Capture d’écran 2024-08-21 à 17 17 14
  • Permissions are the one defined in https://docs.cloudbees.com/docs/cloudbees-ci/latest/traditional-admin-guide/github-app-auth#_creating_the_github_app merged with the existing infra.ci.jenkins.io GH app except
    • "Repository -> Administration (Read Only)" => I believe we don't care about repository created or deleted
    • Code access is (Read Only) => it was (Read-Write) on current GH app but I would like to improve safety and forbid any kind of code change unless a specific credential is used (enforced by pipeline library).

dduportal avatar Aug 21 '24 15:08 dduportal

First try:

* Created a new GH app named `infra.ci.jenkins.io-docker-jobs`

* Only has access to 21 repositories defined in https://github.com/jenkins-infra/kubernetes-management/blob/589bd47280d110b170c1cde1aefdb82ea585145c/config/jenkins-jobs_infra.ci.jenkins.io.yaml#L21 (except jenkinsci/acceptance-test-harness which needs a special installation or a distinct GH credential)
Capture d’écran 2024-08-21 à 17 17 14
* Permissions are the one defined in https://docs.cloudbees.com/docs/cloudbees-ci/latest/traditional-admin-guide/github-app-auth#_creating_the_github_app merged with the existing `infra.ci.jenkins.io` GH app **except**
  
  * "Repository -> Administration (Read Only)" => I believe we don't care about repository created or deleted
  * Code access is (Read Only) => it was (Read-Write) on current GH app but I would like to improve safety and forbid any kind of code change **unless** a specific credential is used (enforced by pipeline library).

Initial test looks good:

  • Set up (manually) the new Jenkins credential of type "GitHub App" in the "Docker Jobs" directory scope (so it's only available for Docker Jobs)
  • Changed the accout-app job to use the new folder-scoped repository and triggered a new scan
  • Checked the API rate limits were evolving differently between the top-level repository and the new one (tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization
  • Ran a full build on the principal branch which triggered a tag push and a new release.

=> let's roll to persist the change!

dduportal avatar Aug 21 '24 16:08 dduportal

Update:

  • Set up custom GH App for both Docker and Terraform Jobs
  • Cleaned up GH APP now unsued (KodiakHQ, Octobox, Slack)
  • Restricted the existing infra.ci GH app to less repositories (removed all Docker and Terraform Jobs from it)

=> That won't prevent the "Scan Repo" wave but it should prevent the API rate limit greatly!

dduportal avatar Aug 21 '24 18:08 dduportal

(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization

Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)

timja avatar Aug 21 '24 21:08 timja

(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization

Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)

Thanks Tim! Good to know if we hit the rate limit again!

dduportal avatar Aug 22 '24 06:08 dduportal

* Restricted the existing infra.ci GH app to less repositories (removed all Docker and Terraform Jobs from it)

Was a bit too restrictive: the (private) repository jenkins-infra/chart-secrets was missing so kubernetes-management was failing since 12h...

dduportal avatar Aug 22 '24 06:08 dduportal

Ah: found an error with the GH App credentials. The specified owner string jenkins-infra is changed to jenkins when JobDSL processes its configuration.

Incoming fix in the chart-generated JobDSL code: https://github.com/jenkins-infra/helm-charts/pull/1303

dduportal avatar Aug 22 '24 08:08 dduportal

  • All builds are now green \o/ (after the fix https://github.com/jenkins-infra/helm-charts/pull/1303)
  • No more builds blocked: I tried to reload JCasc 4 times manually and immediately triggered a terraform build (production):
    • The Terraform build was treated immediately
    • No API rate limit issues anymore

I want to run a complete JobDSL reload (incoming fix for deprecated directives in our chart) before closing this issue to confirm we are not blocked anymore

dduportal avatar Aug 22 '24 09:08 dduportal

looks good!

dduportal avatar Aug 22 '24 10:08 dduportal

Reopening as the Docker GH app still hits its rate limit :'(

dduportal avatar Aug 22 '24 16:08 dduportal

The situation is clearly way better than before, but we still have improvements and fixes to make:

  • Must have:
    • Fix the Updatecli folder which is failing. Success criteria is to remove the top-level update credential
    • Implement Organization folder for Docker Jobs and Update Jobs
      • Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL
  • Nice to have:
    • Get rid of discovering (and building) tags for Docker Jobs:
      • Changes the buildAndPublishDocker pipeline library to build tag during the principal branch semantic release and only push a git tag for audit purpose
      • Set up main branch to remove old builds only if they are older than 6 month (so we can keep audit logs)

dduportal avatar Aug 23 '24 10:08 dduportal

Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)

timja avatar Aug 23 '24 10:08 timja

Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)

Yep, I believe it should work, thanks for the confirmation! We should report the issue though: Multibranch jobs should have a trait to prevent this indexing during a JobDSL reload

dduportal avatar Aug 23 '24 10:08 dduportal

The situation is clearly way better than before, but we still have improvements and fixes to make:

* Must have:
  
  * Fix the Updatecli folder which is failing. Success criteria is to remove the top-level update credential
  * Implement Organization folder for Docker Jobs and Update Jobs
    
    * Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

* Nice to have:
  
  * Get rid of discovering (and building) tags for Docker Jobs:
    
    * Changes the buildAndPublishDocker pipeline library to build tag during the principal branch semantic release and only push a git tag for audit purpose
    * Set up main branch to remove old builds only if they are older than 6 month (so we can keep audit logs)

Update:

  • Updatecli, Terraform and Docker jobs are using their respective per-folder credentials now!

=> Now let's work on support for Org folder

dduportal avatar Aug 23 '24 13:08 dduportal

Update: situation is improved but we're not out of the wood yet.

  • The key consumer is the "tags discovery" of MB jobs. Since we have hundreds of tags for each Docker-* (and packer-images) repository, all of them using a "push tag -> trigger build -> release" currently so need work to change this behavior

  • Splitting the GH app into 4 (Docker, Updatecli, Terraform, others) helped to unblock the kubernetes and terraform runs which decrease the priority here

    • Good thing for splitting permissions, bad thing for managing these apps and it decreases the rate limit for each
    • We need to add others: 1 for Kubernetes (includes a private repo), another for the websites PR and a last one websites deployments
  • Work in progress in implement JobDSL-defined Organization (scanning) Folder in our custom Helm chart. Worth working on it:

    • Needed to decrease the amount of indexing scans of MB children jobs for: Docker, Terraform and Updatecli (additionally: simplify jobs setup)
    • Required for updatecli pipelines separation
    • Required for https://github.com/jenkins-infra/helpdesk/issues/3071
  • Work in progress in fine tuning the current MB jobs indexing: caught (while working on org folders) that we try to index them every 2 hours. Setting a lower frequency would also spread the rate limit usage!

Next steps is heavier but would be worth it:

  • Redefine the Docker build process (https://github.com/jenkins-infra/pipeline-library/blob/master/vars/buildDockerAndPublishImage.groovy)
    • Need an issue in the repository to sustain the discussion
    • Goal: change the release pattern to avoid discovering tags in infra.ci Jobs to sustain years of building images
    • Part of this discussion must include the Garbage collection: how much time do we want to keep:
      • Docker Image tags?
      • GitHub Tags and releases for these Docker images?
      • Release jobs logs for these images?

dduportal avatar Aug 27 '24 07:08 dduportal

Minor update: https://github.com/jenkins-infra/kubernetes-management/pull/5640 should decrease the "waves" of indexing when reloading to only 1 wave

dduportal avatar Aug 30 '24 05:08 dduportal

Update:

  • The PR https://github.com/jenkins-infra/kubernetes-management/pull/5644 did trigger a restart of infra.ci which never went back online.
  • Logs were showing the following JobDSL errors:
    groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442]
    
  • Short term fix: reverted back the 2 previous jenkins-jobs chart (https://github.com/jenkins-infra/kubernetes-management/pull/5640 and https://github.com/jenkins-infra/kubernetes-management/pull/5641) to have infra.ci back online.
  • Current status: infra.ci is up but the kubernetes management jobs is failing due to helm state inconsistency "update/rollback in progress). Keeping like this until a fix is found and tested.
  • I can reproduce the error only with the helm chart on a local k3d: working on it

dduportal avatar Sep 02 '24 07:09 dduportal

Update:

* The PR [Bump `jenkins` helm chart version to 5.5.13 kubernetes-management#5644](https://github.com/jenkins-infra/kubernetes-management/pull/5644) did trigger a restart of infra.ci which never went back online.

* Logs were showing the following JobDSL errors:
  ```
  groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442]
  ```

* Short term fix: reverted back the 2 previous `jenkins-jobs` chart ([Bump `jenkins-jobs` helm chart version to 2.1.2 kubernetes-management#5640](https://github.com/jenkins-infra/kubernetes-management/pull/5640) and [Bump `jenkins-jobs` helm chart version to 2.2.0 kubernetes-management#5641](https://github.com/jenkins-infra/kubernetes-management/pull/5641)) to have infra.ci back online.

* Current status: infra.ci is up but the kubernetes management jobs is failing due to helm state inconsistency "update/rollback in progress). Keeping like this until a fix is found and tested.

* I can reproduce the error only with the helm chart on a local k3d: working on it

OH: https://github.com/helm/helm/issues/5446#issuecomment-866994215

However removing any instance of a space before a newline fixed it.

dduportal avatar Sep 02 '24 07:09 dduportal

Looks like the infra.ci errors are good 👍

dduportal avatar Sep 02 '24 13:09 dduportal

Let's resume this topic again: The core consumer on infra.ci of GitHub API requests are Multibranch jobs which discover tags. As we reach a high amount of tags for each project, each JCasC reload (or a webhook telling a Multibranch to scan its repository) generate an insane amount of requests.

We should keep different kinds of GH Apps to split the usage kind, as a safety measure.

However, if we stop discovering tags for at least the Docker image builds, we should make the problem go away (or at least contain it so we do not need to care more).

Opened https://github.com/jenkins-infra/pipeline-library/issues/918 to work on this topic specifically on the pipeline library.

Note: this change is breaking and will require a job configuration change on infra.ci.jenkins.io (can be applied to one project for initial testing though):

  • No more tags discovery
  • No more tags triggered builds

Assigning to @jayfranco999 for initial information gathering

dduportal avatar Mar 21 '25 10:03 dduportal

Update:

  • https://github.com/jenkins-infra/pipeline-library/issues/918 works as intended for PR builds without introducing any regression. The 'deploy' and 'release' stages have been restructured to ensure the build deploys both latest and semantic-version: xxx docker images only when triggered on the principal branch

  • Tag discovery has been disabled on Docker-404 to test the nature of the pipeline script.

  • https://github.com/jenkins-infra/docker-404/pull/40 has been opened to test the pipeline script. The PR builds work as intended. Once reviewed, we can merge the PR to test the deploy and release stages in the pipeline

jayfranco999 avatar Apr 10 '25 08:04 jayfranco999

  • https://github.com/jenkins-infra/pipeline-library/issues/918 is now closed.
  • https://github.com/jenkins-infra/kubernetes-management/pull/6437 did stop tag scanning for all the Docker jobs

=> we merged 3 PRs on the 3 docker-jenkins* and delivered them to infra.ci without any rate limit. Congrats 👏

For later reader: we keep the distinct GitHub apps between Docker Jobs, Terraform Jobs, and the others, as a safety measure.

dduportal avatar Apr 11 '25 09:04 dduportal