helpdesk
helpdesk copied to clipboard
[ci.jenkins.io][Infra-as-code] Define Job Configuration as code with JobDSL
Summary
Switch job configuration and management as code for ci.jenkins.io instead of manual management.
Why
There is no audit of which configuration changed was applied on the jobs on ci.jenkins.io. It's a concern for:
- Security: no audit log, and a painful update process: an admin must manually change configuration (and eventually takes screenshots, trigger a re-scan, and validate
- Maintenability: manually managing a Jenkins instance is not sustainable for ci.jenkins.io: we are 10 to 20 (maybe more 😱 ) admins, no planning, not centralized notification so anyone can break the instance at any moment without the other being even aware (or worse: concurent plugins upgrade \o/) (same problem as #3070 )
Also, managing jobs configuration as code would allow us to remove the "job config" plugin which is known to slow down instances.
What
- The Job DSL Plugin is a good candidate, because it can be integrated with JCasc. The infra-team uses it for infra.ci since months and even made a custom helm chart to "templatize" the definition: https://github.com/jenkins-infra/helm-charts/tree/main/charts/jenkins-jobs (with support of per-projects credentials)
- Alternative is to use Groovy script at runtime, but I'm not sure how it works in real life.
⚠️ Using job-dsl introduces the following challenges to be aware of:
-
Reproductibility / maintanibility: the development lfiecycle is not easy and requires Jenkins expertise to really understand the domain.
- Leveraging: the plugin is extensively documented (https://github.com/jenkinsci/job-dsl-plugin/wiki/Job-DSL-Commands, https://github.com/jenkinsci/job-dsl-plugin/wiki/JCasC , https://github.com/jenkinsci/job-dsl-plugin/wiki/Real-World-Examples, https://github.com/jenkinsci/job-dsl-plugin/wiki/User-Power-Moves, https://github.com/jenkinsci/job-dsl-plugin/wiki/Testing-DSL-Scripts) and also provides a useful APi viewer on Jenkins controllers as per https://plugins.jenkins.io/job-dsl/#plugin-content-documentation. Ideally, we could re-use the helm-chart template to benefits from the work already done in Kubernetes world (
helm templateshould be good enough :) ) without requiring a migration to Kubernetes (that would be another topic for ci.jenkins.io).
- Leveraging: the plugin is extensively documented (https://github.com/jenkinsci/job-dsl-plugin/wiki/Job-DSL-Commands, https://github.com/jenkinsci/job-dsl-plugin/wiki/JCasC , https://github.com/jenkinsci/job-dsl-plugin/wiki/Real-World-Examples, https://github.com/jenkinsci/job-dsl-plugin/wiki/User-Power-Moves, https://github.com/jenkinsci/job-dsl-plugin/wiki/Testing-DSL-Scripts) and also provides a useful APi viewer on Jenkins controllers as per https://plugins.jenkins.io/job-dsl/#plugin-content-documentation. Ideally, we could re-use the helm-chart template to benefits from the work already done in Kubernetes world (
-
Job scanning / Scalability: given the huge amount of jobs (plugins...) , the time required for job-dsl to process scanning job configuration during a restart could be an issue. It's hard to evaluate to be fair.
Issues that could have benefited from this:
- https://github.com/jenkins-infra/helpdesk/issues/1355
- https://github.com/jenkins-infra/helpdesk/issues/2832
- https://github.com/jenkins-infra/helpdesk/issues/2776
- https://github.com/jenkins-infra/helpdesk/issues/1441
- https://github.com/jenkins-infra/helpdesk/issues/2466
- https://github.com/jenkins-infra/helpdesk/issues/1834
- https://github.com/jenkins-infra/helpdesk/issues/3059
- https://github.com/jenkins-infra/helpdesk/issues/2358
- https://github.com/jenkins-infra/helpdesk/issues/707
Is there a reason to not move ci.jenkins.io to k8s?
Is there a reason to not move ci.jenkins.io to k8s?
When the AKS cluster was created (long time ago):
- ci.jenkins.io was running Jenkins with the official Ubuntu package (no Docker, no casc). There were not enough Jenkins administration expertise to run it with container in production.
- there was no instances providing the required 64 Gb of memory that the VM hosting ci.jenkins.io currently have + the I/O bandwidth.
Both these historical reasons are gone since months if not years:
- Kubernetes is clearly better to manage Jenkins controllers in our context
- AKS support instance of all shapes and size to support ci.jenkins.io
- ci.jenkins now runs in a Docker container and already as part in JCasc (not all yet)
I don't see any reason not to move ci.jenkins.io to k8s honestly, only benefits:
- Centralized management (not split between puppet and kube)
- Reproductibility (easier to spawn a testing instance)
- Ability to benefit from the jenkins-jobs system
- Safer: centralized credentials in sops (e.g. with cloud vault and/or GPG, with private repository)
- Migrating to k8s with a scratch new JENKINS_HOME would clearly removes a lot of transient issues due to years of using the same data volumes with tons of XML depercated or missing config files (a look at ci.jenkins.io logs clearly shows this).
There are 2 blocking points though:
- #3070 and #2708 are required (e.g. getting rid of all the legacy groovy scripts)
- Having a Docker image supporting cgroups v2 (requires JDK17 or the brand new JDK 11.0.16 which backports this feature to JDK11)