New Machine requirement: Linux/x64 equinix dockerhost replacement
I need to request a new machine:
- New machine operating system (e.g. linux/windows/macos/solaris/aix): Linux
- New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
- Provider (leave blank if it does not matter): Skytap
- Desired usage: Replacement for the two dockerhost x64 systems currently hosted on Equinix
- Any unusual specification/setup required: docker for running dockerhost containers and build pipelines
- How many of them are required: 1 (for now)
Please explain what this machine is needed for: Replacement for Equinix systems which we have to decommission as per https://github.com/adoptium/infrastructure/issues/3292
System provisioned at skytap with 24 cores, 64Gb RAM, and a 256Gb filesystem on /var/lib/docker IP 20.61.136.254 and it calls itself dockerhost-skytap-ubuntu2204-x64-1 I'm not clear yet whether it will accept inbound connections on high numbered ports so if that's not fixable we'll have to make it call into the jenkins server over JNLP for any containers we have on there.
I'm not clear yet whether it will accept inbound connections on high numbered ports
Not a problem - they're not restricted by default.
I've connected a container for experiental purposes running Fedora 39 to jenkins and running an AQA run at https://ci.adoptium.net/job/AQA_Test_Pipeline/206 🤞🏻
This container is not intended to be retained after this test, so it does not have the ci.role.test label on it
Host machine has been tested with docker builds at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-x64-temurin/471/console on dockerhost-skytap-ubuntu2204-x64-1 so I'll aim to get this activated properly for the weekend runs or on Monday, subject to there being no risk to any outstanding items in the release cycle.
Ran in about 13 minutes vs around 8 on the Equinix systems.
Skytap: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (24 core)
Equinix: AMD EPYC 7401P 24-Core Processor or Intel(R) Xeon(R) Gold 6314U CPU @ 2.30GHz
I've connected a container for experiental purposes running Fedora 39 to jenkins and running an AQA run at https://ci.adoptium.net/job/AQA_Test_Pipeline/206 🤞🏻
- sanity.openjdk had 11 failures) - re-running in Grinder#8673. Note that the previous sanity.openjdk run on the Fedora 35 container had 34 failures although #151 passed on that machine
- extended.openjdk hit the 10 hours time limit - re-running with 30hrs in Grinder#8674
EDIT: extended grinder re-run stopped after 10 hours - trying at Grinder#8675
Others were ok.
I've installed temurin-8-jdk as a package so that JDK8 is the default on the machine. This appears to be required for the gradle version we use in the installer process. JDK21 is still available (installed via tarball) and is being used for the jenkins agent.
+ ./gradlew packageJdkAlpine checkJdkAlpine --parallel -PPRODUCT=temurin -PPRODUCT_VERSION=8 -PARCH=x86_64 -PGPG_KEY=****
Picked up _JAVA_OPTIONS: -Xmx4g
Starting a Gradle Daemon (subsequent builds will be faster)
FAILURE: Build failed with an exception.
* Where:
Settings file '/home/jenkins/workspace/adoptium-packages-linux-pipeline_new@2/settings.gradle'
* What went wrong:
Could not compile settings file '/home/jenkins/workspace/adoptium-packages-linux-pipeline_new@2/settings.gradle'.
> startup failed:
General error during conversion: Unsupported class file major version 65
java.lang.IllegalArgumentException: Unsupported class file major version 65
The three executors are running build jobs that can each take quite a bit of space on the jenkins workspace sine the build volumes are mapped from the host. Also the installer generations can use quite a bit of space on the host workspace. See https://github.com/adoptium/infrastructure/issues/3362
At present there are up to 6Gb (I think a full build of the latest release might take close to 10Gb) on various directories on the host file system.
256Gb filesystem on /var/lib/docker
I'm going to redo this file system with about 100Gb for /home/jenkins/workspace and the rest as /var/lib/docker. The current dockerhost-equinix-ubuntu2004-x64-1 machine has 62Gb in the jenkins workspace (That may need to be looked at as it's quite high) so 100Gb should be enough.
Noting that the Fedora 39 container is working as well as most of the other systems as per https://github.com/adoptium/aqa-tests/issues/5012#issuecomment-1916796930
Noting that https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-alpine-linux-x64-temurin/393/ and the equivalent on other versions appears to insist on running on one of the equinix dockerhosts at the moment as it's looking for build&&alpine-linux&&x64&&dockerBuild - we'll need to think about that labelling convention ...
23:30:52 [NODE SHIFT] MOVING INTO DOCKER NODE MATCHING LABELNAME build&&alpine-linux&&x64&&dockerBuild...
[Pipeline] node
23:31:07 Still waiting to schedule task
23:31:07 ‘[dockerhost-equinix-ubuntu2204-x64-1](https://ci.adoptium.net/computer/dockerhost%2Dequinix%2Dubuntu2204%2Dx64%2D1/)’ is offline
23:59:27 Running on [dockerhost-equinix-ubuntu2204-x64-1](https://ci.adoptium.net/computer/dockerhost%2Dequinix%2Dubuntu2204%2Dx64%2D1/) in /home/jenkins/workspace/build-scripts/jobs/jdk17u/jdk17u-alpine-linux-x64-temurin
[Pipeline] {
Inventory PR for this system: https://github.com/adoptium/infrastructure/pull/3358 I've added it to Bastillion, @steelhead31 is managing Nagios installation prior to merging that PR
Nagios & Wazuh installed successfully.
Note: I've added alpine-linux to the labels on the machine for now until we look at alternate solutions in the issue mentioned above.
Added https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D4/ https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D5/ https://ci.adoptium.net/computer/test%2Ddocker%2Ddebian12%2Dx64%2D1/
containers to the machine
https://ci.adoptium.net/job/AQA_Test_Pipeline/208/console https://ci.adoptium.net/job/AQA_Test_Pipeline/210/console https://ci.adoptium.net/job/AQA_Test_Pipeline/211/console, respectively
Initial machine is in place and working. While we may wish to add additional containers onto this machine that can be done at a later date so I shall close this. Noting that #3378 covers setting up a second machine for the same purpose.
This machine was offline due to our monthly x64 credits at Skytap having expired. It has been changed from its original configuration to have 16GB RAM and six vCPUs and brought online again, but it still has a number of static docker containers defined.
The machine has been up for 2 days, 7h01 (My working assumption is that the rollover date for the credits is on the month boundary, but that may not be true) and it's currently showing this:
@Haroon-Khel I'm struggling to bring the machines back online - has the port information in the jenkins agent definitions become de-synchronised from what is on the host? e.g. https://ci.adoptium.net/computer/test%2Ddocker%2Dubi8%2Dx64%2D3/log which seems to be on a different port - is this expected?
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b67b6d5f2601 aqa_ubi8 "/usr/sbin/sshd -D" 5 weeks ago Up 2 days 0.0.0.0:32771->22/tcp, :::32771->22/tcp UBI8.32790
I've changed that particular agent definition to be on 32771 and it has come up ok but would be good to understand some of the others. I'd quite like to get at least one other container live on there (any more may cause a problem with the restricted number of CPU cores). Since I've fixed that one, https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D4/log is an example of the failure.
Yeah Im seeing this in https://github.com/adoptium/infrastructure/issues/3486#issuecomment-2039497160 too. Not sure what caused docker to reassign ports. Looking into it
Its caused because we now dont specify a port (allowing docker to randomly assign one), https://github.com/adoptium/infrastructure/blob/b728c86a1b2fe798c29cae85f7b23e50ff9686fa/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/deploy_container/tasks/deploy.yml#L27
Then when the dockerhost machine is restarted, docker will randomly assign a port again instead of giving the containers their previous port. TLDR a port needs to be specified on container startup instead of relying on docker to give a random one
That's another thing that won't be a problem if we switch over the connecting the containers over JNLP ;-)
The containers are back online (https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log refuses to come back up for other reasons). The problem should not reoccur with the existing containers. I need to change https://github.com/adoptium/infrastructure/blob/b728c86a1b2fe798c29cae85f7b23e50ff9686fa/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/deploy_container/tasks/deploy.yml#L27 to specify a port number to prevent this from happening in the future
Sounds good thanks - Jenkins logs should be clearer now after today's cleanups. Need to wait for Ludovic to come back to fix the RISC-V ones but that should be another load of warnings to disappear from Jenkins 👍
The containers are back online (https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log refuses to come back up for other reasons).
Do you know what the reason is? It's "curious" to note that the port number is 32768, exactly 2^15
I'm going to close this now. Any future work can happen under other issues if required.
https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log is back online, I recreated its container and now the jenkins agent has no trouble connecting