infrastructure
                                
                                 infrastructure copied to clipboard
                                
                                    infrastructure copied to clipboard
                            
                            
                            
                        Secure the public Jenkins server
This issue is to encapsulate the requirements and work needed to replace our existing Jenkins server with a new one:
Existing Jenkins server is suffering from some instability. Recent difficult updates have shown that we need a better disaster recovery plan (this is also related to: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1295) It is also a ubuntu-16.04 machine. For all of these reasons, and likely reasons not yet listed, we should begin the work to replace the existing Jenkins server with a new one.
Requirements:
- Similar requirements to existing server
- Ansible playbook for creation/spinup
- Staging server to try out upgrades and major features or changes
- Clearly documented process for deploying updates
- Back up / disaster recovery process (in place and tested) / training for multiple people in multiple timezones to be able to assist
- Policy & documentation for how many jobs to keep in history, how many days to keep jobs
- How many of them are required: 1
Please explain what this machine is needed for:
- nightly/weekly/release builds and testing
- Grinders for debugging and triage
- building tools and dependencies
- building Docker images
- building installers
Considerations:
- Given that this server is for our production builds, should we limit freeform builds, personal duplicates of existing pipelines, etc.?
- Should we have a separate sandbox for 'experimental' jobs and work?
- We should have a script that checks for jobs that have not been run/used in X months to remove old/stale jobs (may not need if all jobs have policy for how many days to keep jobs)
- Consider pushing artifacts to artifactory or other storage so that they are available for longer than
I'd personally like to see this bootstrapped via ansible and that we also build a staging server to test out any major pipeline changes, jenkins upgrades, and so forth.
re: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/2108#issuecomment-814417860 - added to the requirements section
@karianna Since you've been pushing for this and suggesting requirements, are you able to take ownership of this item and progress it?
@karianna Since you've been pushing for this and suggesting requirements, are you able to take ownership of this item and progress it?
I can own it, but I'll be pulling in folks who actually know how to ansible :-). Typical Engineering manager eh ;-)
https://github.com/jenkins-x/jx is something to explore
I also think that for security we should disallow jobs from running on the master node unless there's a clear and explicit need for it.
In addition to Shelley's requirements, I'm also going to add VPN security - this Jenkins should no longer be accessible on the public internet.
In addition to Shelley's requirements, I'm also going to add VPN security - this Jenkins should no longer be accessible on the public internet.
To be absoltely clear here - that statement is about non-HTTPS ports right? (I agree and was thinking the same, however we also need incoming ports enabled for several of our machines to connect)
After further investigation jenkins-x is not a great option as its pipeline implementation and syntax is not compatible with regular Jenkins, forcing us to have a massive rewrite (and forcing jenkins x on other vendors using our scripts).
By using Ansible we can be fairly hosting provider agnostic. But we do need a provider that hosts a VPN easily and has sufficient disk storage (3-4TB) at a decent price point.
We'll prototype:
- Ansible Playbook for Jenkins with a Jenkins LTS Docker Image as a base starting point.
- Deploying to Hetzner Cloud in a Docker Container
- Apply Networking, Firewalling and OpenVPN for security
As discussed elsewhere while the full set of items descried in the prototype comment above has now been put on pause we should as a priority look at upgrading the OS underneath the jenkins server (ideally to 20.04) after taking a Hetzner snapshot to avoid any problems with the OS upgrade.
Eclipse WorkGroup finance approval acquired for the snapshot costs
Snapshot created, however deploying it to an alternate server to do an upgrade test would require additional costs.
Approval for costs to deploy the snapshot was given by the WG last Thursday, although it seems we can only deploy to a comparably large server. @karianna will contact the provider to see if we can deploy on a less expensive system.
Discovered yesterday that the backups have not been executing correctly. The absence of a config.xml under /jobs/build-scripts/jobs/jobs/jobs/jdk9u was causing this SEVERE error to occur, at which point it left a partial backup in place and did not continue with the rest of it:
2022-03-18 05:57:37.925+0000 [id=2409973]       SEVERE  o.j.h.p.t.ThinBackupPeriodicWork#backupNow: Cannot perform a backup. Please be sure jenkin
s/hudson has write privileges in the configured backup path '/mnt/backup-server/jenkins_backup'.
java.io.FileNotFoundException: Source '/home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk9u/config.xml' does not exist
        at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1074)
        at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1038)
        at org.jvnet.hudson.plugins.thinbackup.backup.HudsonBackup.backupJobsDirectory(HudsonBackup.java:227)
        at org.jvnet.hudson.plugins.thinbackup.backup.HudsonBackup.backupJobsDirectory(HudsonBackup.java:228)
        at org.jvnet.hudson.plugins.thinbackup.backup.HudsonBackup.backupJobsDirectory(HudsonBackup.java:228)
        at org.jvnet.hudson.plugins.thinbackup.backup.HudsonBackup.backupJobs(HudsonBackup.java:209)
        at org.jvnet.hudson.plugins.thinbackup.backup.HudsonBackup.backup(HudsonBackup.java:168)
        at org.jvnet.hudson.plugins.thinbackup.ThinBackupPeriodicWork.backupNow(ThinBackupPeriodicWork.java:89)
        at org.jvnet.hudson.plugins.thinbackup.ThinBackupPeriodicWork.execute(ThinBackupPeriodicWork.java:66)
        at org.jvnet.hudson.plugins.thinbackup.hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:53)
        at java.lang.Thread.run(Thread.java:748)
While I have put that file in place and been able to take a successful backup (I've adjusted the backup plugin to backup to /home/jenkins/sxabackups as the CIFS-mounted filesystem is REALLY slow to backup nearly 100k files) I'm currently seeing instability on the jenkins server (including this OOM kill). I /think/ it's related to the backups, therefore I have disabled the automatic backup schedule for now (It was H 5 * * *)
Mar 18 20:39:03 kernel: [2790458.741260] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar 18 20:39:03 kernel: [2790458.741327] [ 1057]  1000  1057  6096303  4920123 41738240        0             0 java
Mar 18 20:39:03 kernel: [2790458.741461] Out of memory: Kill process 1057 (java) score 575 or sacrifice child
Mar 18 20:39:03 kernel: [2790458.744656] Killed process 1057 (java) total-vm:24385212kB, anon-rss:19680580kB, file-rss:0kB, shmem-rss:0kB
Mar 18 20:39:04 kernel: [2790460.451126] oom_reaper: reaped process 1057 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Mar 18 20:39:04 jenkins: jenkins: fatal: client (pid 1057) killed by signal 9, exiting
OOM kill again this evening. Given that it's an OOM condition it seems unlikely that a heap size increase will make a difference, but in the absence of any better ideas or options just now I've given the JVM another 2Gb to play with. It's back again.
This morning Martijn and I successfully upgraded the jenkins server now upgraded to Ubuntu 20.04 and it's now running the jenkins server on Temurin JDK 11.0.14.1+1 installed from our apt repositories.
In-use heap as shown internally by jenkins is showing far lower utilisation - down below 1GB at the time of writing compared to several times more than that which was in use previously, which may give us the option to reduce the -Xmx value
The following jobs were causing errors in the jenkins logs due to having zero length build.xml files - they have been backed up as zeroLengthDirectories.tar.gz in the sxabackups directory
oot@jenkins-hetzner-ubuntu2004 /var/log/jenkins # ls -ld  `grep could.not.load jenkins.log | sed 's/^.*WARNING//g' | sort | uniq -c | awk '{print$NF}'`
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk8u/jobs/jdk8u-aix-ppc64-openj9/builds/422
drwxr-xr-x 5 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk8u/jobs/jdk8u-linux-ppc64le-openj9/builds/343
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk/jobs/jdk-alpine-linux-x64-hotspot/builds/46
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk/jobs/jdk-linux-aarch64-hotspot/builds/351
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk/jobs/jdk-linux-s390x-openj9/builds/212
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk/jobs/jdk-linux-x64-hotspot/builds/42
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk/jobs/jdk-linux-x64-openj9/builds/212
drwxr-xr-x 4 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/build-scripts/jobs/jobs/jobs/jdk/jobs/jdk-windows-x64-hotspot/builds/368
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_extended.functional_x86-64_linux/builds/3
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_extended.system_ppc64_aix/builds/2
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_extended.system_ppc64le_linux/builds/2
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.functional_ppc64le_linux/builds/4
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.functional_x86-64_mac/builds/2
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.functional_x86-64_windows/builds/3
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.functional_x86-64_windows_xl/builds/27
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.functional_x86-64_windows_xl/builds/28
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.system_ppc64le_linux/builds/4
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk16_j9_sanity.system_x86-64_mac/builds/2
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk17_hs_sanity.functional_x86-32_windows/builds/6
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk17_hs_sanity.system_x86-32_windows/builds/6
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk17_j9_extended.system_x86-64_windows_xl/builds/6
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_hs_extended.system_s390x_linux/builds/17
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_hs_extended.system_x86-64_mac/builds/83
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_hs_sanity.external_x86-64_linux/builds/1
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_hs_sanity.external_x86-64_linux_thorntail-mp-tck/builds/1
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_hs_special.functional_x86-64_linux/builds/242
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_j9_extended.openjdk_x86-64_linux/builds/1
drwxr-xr-x 3 jenkins jenkins 4096 Mar 30 13:40 /home/jenkins/.jenkins/jobs/Test_openjdk8_j9_sanity.system_ppc64le_linux/builds/2
root@jenkins-hetzner-ubuntu2004 /var/log/jenkins #
Note for something to keep an eye on: There was a spike in heap usage at around 1656 yesterday - there didn't seem to be too much unexpected things in the log around that time:
2022-03-30 16:54:00.621+0000 [id=22812] INFO    o.j.p.workflow.job.WorkflowRun#finish: Grinder #4212 completed: SUCCESS
2022-03-30 16:54:00.622+0000 [id=22812] INFO    h.p.t.l.JenkinsRunListener#onCompleted: onCompleted: Grinder #4212
2022-03-30 16:54:20.289+0000 [id=21908] INFO    hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel build-macstadium-macos11-arm64-2.
java.util.concurrent.TimeoutException: Ping started at 1648659020288 hasn't completed by 1648659260288
        at hudson.remoting.PingThread.ping(PingThread.java:134)
        at hudson.remoting.PingThread.run(PingThread.java:90)
2022-03-30 16:54:36.812+0000 [id=23047] OFF     hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started OpenStack slave cleanup
2022-03-30 16:54:39.832+0000 [id=23047] OFF     hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished OpenStack slave cleanup. 3,019 ms
2022-03-30 16:55:07.311+0000 [id=22249] INFO    o.j.p.ghprb.GhprbRootAction#handleAction: Checking PR #3,476 for adoptium/aqa-tests
2022-03-30 16:56:32.744+0000 [id=65]    INFO    o.j.p.P.u.DropCachePeriodicWork#doRun: begin schedule clean...
2022-03-30 16:56:32.744+0000 [id=65]    INFO    o.j.p.P.u.DropCachePeriodicWork#doRun: end schedule clean...
2022-03-30 16:56:33.330+0000 [id=22805] INFO    h.p.t.l.JenkinsRunListener#onCompleted: onCompleted: git-mirrors/adoptium/git-skara-jdk #7391
2022-03-30 16:56:33.377+0000 [id=22805] INFO    j.p.s.l.SlackNotificationsLogger#info: [git-mirrors » adoptium » git-skara-jdk #7391] found #7390 as previous completed, non-aborted build
2022-03-30 16:56:35.602+0000 [id=23068] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started Periodic background build discarder
2022-03-30 16:56:39.800+0000 [id=23068] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished Periodic background build discarder. 4,197 ms
2022-03-30 16:57:21.698+0000 [id=23121] INFO    c.s.o.i.Platform$JdkWithJettyBootPlatform#getSelectedProtocol: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
2022-03-30 16:57:46.870+0000 [id=23139] INFO    hudson.model.AsyncAperiodicWork#lambda$doAperiodicRun$0: Started Update IdP Metadata from URL PeriodicWork
2022-03-30 16:57:46.871+0000 [id=23139] INFO    hudson.model.AsyncAperiodicWork#lambda$doAperiodicRun$0: Finished Update IdP Metadata from URL PeriodicWork. 1 ms

Updated to LTS 2.332.3 and updating some of the plugins too.
Usage over the last 24 hours has been similar to the graph posted a couple of comments ago. System is now runnign on Ubuntu 20.04 with the latest LTS and all plugins that were being reported as security concerns have been updated.
We will continue to look at improving the reliability of the server and keeping it up to date, but the initial concerns relating to this issue have now been resolved. Backups were validated prior to the OS upgrade.
Remaining plugins upgraded last week.
I've brought up a new agent (it's running on the current AWX server as it had lots of capacity) to offload jobs which were running on the master node to try and stop it running things on the server.
There have been some side effects of this that will need some attention:
Follow-on task to clear up the number of credentials in use on the jenkins server. Being tracked in this (restricted access) document: https://docs.google.com/spreadsheets/d/1TwslGkCcfYsJjZeWTJNR3G2_SywgqBgnwe54Zm5gkA0/edit#gid=0
Todays plugin updates:
pipeline-stage-tags-metadata.jpi
email-ext.jpi
azure-vm-agents.jpi
pipeline-model-api.jpi
pipeline-model-extensions.jpi
github-oauth.jpi
pipeline-model-definition.jpi
ssh-slaves.jpi
Parameterized-Remote-Trigger.jpi
azure-vm-agents
Parameterized-Remote-Trigger
pipeline-model-extensions
github-oauth
pipeline-model-api
pipeline-model-definition
ssh-slaves
email-ext
pipeline-stage-tags-metadata
job-dsl.jpi
job-dsl
We should consider the use of https://plugins.jenkins.io/job-restrictions (although it's up for adoption) as a way to restrict access for certain jobs further.
Ref: https://www.jenkins.io/blog/2022/06/28/require-java-11/
At the time of writing this comment we are on Jenkins 2.332.3
Jenkins 2.357 (June 28, 2022) and the next LTS in September will require Java 11. We area already running under java 11 after a previous upgrade.
Jenkins 2.355 (June 14, 2022) and Jenkins 2.346.1 LTS (released on June 22, 2022) also now formally support being run on java 17, so we should consider moving ours up to java 17 either at the same time as doing that upgrade or shortly afterwards, and look at how much that impacts CPU/RAM usage.
Plugins have already been prepared in JENKINS-68446 and the recommendation is to use the Plugin Manager to upgrade all plugins before and after upgrading to Jenkins 2.357.
Adding this for reference - memory usage chart of jenkins before upgrading to latest LTS today (Currently on 2.332.3 on Temurin-11.0.15+10)

NOTE: Jenkins pipeline issue that required -Dhudson.plugins.git.GitSCM.ALLOW_LOCAL_CHECKOUT=true is covered in https://github.com/adoptium/ci-jenkins-pipelines/issues/313
Upgrading today to latest LTS - 2.346.1
This version supports running on JDK17 so we have edited the configuration to run with Temurin-17.0.3+7