infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

New Machine requirement: Linux/x64 equinix dockerhost replacement

Open sxa opened this issue 1 year ago • 20 comments

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Linux
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter): Skytap
  • Desired usage: Replacement for the two dockerhost x64 systems currently hosted on Equinix
  • Any unusual specification/setup required: docker for running dockerhost containers and build pipelines
  • How many of them are required: 1 (for now)

Please explain what this machine is needed for: Replacement for Equinix systems which we have to decommission as per https://github.com/adoptium/infrastructure/issues/3292

sxa avatar Jan 23 '24 12:01 sxa

System provisioned at skytap with 24 cores, 64Gb RAM, and a 256Gb filesystem on /var/lib/docker IP 20.61.136.254 and it calls itself dockerhost-skytap-ubuntu2204-x64-1 I'm not clear yet whether it will accept inbound connections on high numbered ports so if that's not fixable we'll have to make it call into the jenkins server over JNLP for any containers we have on there.

sxa avatar Jan 24 '24 11:01 sxa

I'm not clear yet whether it will accept inbound connections on high numbered ports

Not a problem - they're not restricted by default.

I've connected a container for experiental purposes running Fedora 39 to jenkins and running an AQA run at https://ci.adoptium.net/job/AQA_Test_Pipeline/206 🤞🏻 This container is not intended to be retained after this test, so it does not have the ci.role.test label on it

sxa avatar Jan 25 '24 12:01 sxa

Host machine has been tested with docker builds at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-x64-temurin/471/console on dockerhost-skytap-ubuntu2204-x64-1 so I'll aim to get this activated properly for the weekend runs or on Monday, subject to there being no risk to any outstanding items in the release cycle. Ran in about 13 minutes vs around 8 on the Equinix systems. Skytap: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (24 core) Equinix: AMD EPYC 7401P 24-Core Processor or Intel(R) Xeon(R) Gold 6314U CPU @ 2.30GHz

sxa avatar Jan 25 '24 18:01 sxa

I've connected a container for experiental purposes running Fedora 39 to jenkins and running an AQA run at https://ci.adoptium.net/job/AQA_Test_Pipeline/206 🤞🏻

EDIT: extended grinder re-run stopped after 10 hours - trying at Grinder#8675

Others were ok.

sxa avatar Jan 26 '24 18:01 sxa

I've installed temurin-8-jdk as a package so that JDK8 is the default on the machine. This appears to be required for the gradle version we use in the installer process. JDK21 is still available (installed via tarball) and is being used for the jenkins agent.

+ ./gradlew packageJdkAlpine checkJdkAlpine --parallel -PPRODUCT=temurin -PPRODUCT_VERSION=8 -PARCH=x86_64 -PGPG_KEY=****
Picked up _JAVA_OPTIONS: -Xmx4g
Starting a Gradle Daemon (subsequent builds will be faster)

FAILURE: Build failed with an exception.

* Where:
Settings file '/home/jenkins/workspace/adoptium-packages-linux-pipeline_new@2/settings.gradle'

* What went wrong:
Could not compile settings file '/home/jenkins/workspace/adoptium-packages-linux-pipeline_new@2/settings.gradle'.
> startup failed:
  General error during conversion: Unsupported class file major version 65
  
  java.lang.IllegalArgumentException: Unsupported class file major version 65

sxa avatar Jan 29 '24 11:01 sxa

The three executors are running build jobs that can each take quite a bit of space on the jenkins workspace sine the build volumes are mapped from the host. Also the installer generations can use quite a bit of space on the host workspace. See https://github.com/adoptium/infrastructure/issues/3362

At present there are up to 6Gb (I think a full build of the latest release might take close to 10Gb) on various directories on the host file system.

256Gb filesystem on /var/lib/docker

I'm going to redo this file system with about 100Gb for /home/jenkins/workspace and the rest as /var/lib/docker. The current dockerhost-equinix-ubuntu2004-x64-1 machine has 62Gb in the jenkins workspace (That may need to be looked at as it's quite high) so 100Gb should be enough.

sxa avatar Jan 30 '24 11:01 sxa

Noting that the Fedora 39 container is working as well as most of the other systems as per https://github.com/adoptium/aqa-tests/issues/5012#issuecomment-1916796930

sxa avatar Jan 30 '24 13:01 sxa

Noting that https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-alpine-linux-x64-temurin/393/ and the equivalent on other versions appears to insist on running on one of the equinix dockerhosts at the moment as it's looking for build&&alpine-linux&&x64&&dockerBuild - we'll need to think about that labelling convention ...

23:30:52  [NODE SHIFT] MOVING INTO DOCKER NODE MATCHING LABELNAME build&&alpine-linux&&x64&&dockerBuild...
[Pipeline] node
23:31:07  Still waiting to schedule task
23:31:07  ‘[dockerhost-equinix-ubuntu2204-x64-1](https://ci.adoptium.net/computer/dockerhost%2Dequinix%2Dubuntu2204%2Dx64%2D1/)’ is offline
23:59:27  Running on [dockerhost-equinix-ubuntu2204-x64-1](https://ci.adoptium.net/computer/dockerhost%2Dequinix%2Dubuntu2204%2Dx64%2D1/) in /home/jenkins/workspace/build-scripts/jobs/jdk17u/jdk17u-alpine-linux-x64-temurin
[Pipeline] {

sxa avatar Feb 01 '24 00:02 sxa

Inventory PR for this system: https://github.com/adoptium/infrastructure/pull/3358 I've added it to Bastillion, @steelhead31 is managing Nagios installation prior to merging that PR

sxa avatar Feb 01 '24 12:02 sxa

Nagios & Wazuh installed successfully.

steelhead31 avatar Feb 01 '24 14:02 steelhead31

Note: I've added alpine-linux to the labels on the machine for now until we look at alternate solutions in the issue mentioned above.

sxa avatar Feb 01 '24 16:02 sxa

Added https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D4/ https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D5/ https://ci.adoptium.net/computer/test%2Ddocker%2Ddebian12%2Dx64%2D1/

containers to the machine

https://ci.adoptium.net/job/AQA_Test_Pipeline/208/console https://ci.adoptium.net/job/AQA_Test_Pipeline/210/console https://ci.adoptium.net/job/AQA_Test_Pipeline/211/console, respectively

Haroon-Khel avatar Feb 01 '24 16:02 Haroon-Khel

Initial machine is in place and working. While we may wish to add additional containers onto this machine that can be done at a later date so I shall close this. Noting that #3378 covers setting up a second machine for the same purpose.

sxa avatar Feb 14 '24 12:02 sxa

This machine was offline due to our monthly x64 credits at Skytap having expired. It has been changed from its original configuration to have 16GB RAM and six vCPUs and brought online again, but it still has a number of static docker containers defined.

The machine has been up for 2 days, 7h01 (My working assumption is that the rollover date for the credits is on the month boundary, but that may not be true) and it's currently showing this: image

sxa avatar Apr 04 '24 16:04 sxa

@Haroon-Khel I'm struggling to bring the machines back online - has the port information in the jenkins agent definitions become de-synchronised from what is on the host? e.g. https://ci.adoptium.net/computer/test%2Ddocker%2Dubi8%2Dx64%2D3/log which seems to be on a different port - is this expected?

CONTAINER ID   IMAGE        COMMAND               CREATED       STATUS      PORTS                                             NAMES
b67b6d5f2601   aqa_ubi8     "/usr/sbin/sshd -D"   5 weeks ago   Up 2 days   0.0.0.0:32771->22/tcp, :::32771->22/tcp           UBI8.32790

I've changed that particular agent definition to be on 32771 and it has come up ok but would be good to understand some of the others. I'd quite like to get at least one other container live on there (any more may cause a problem with the restricted number of CPU cores). Since I've fixed that one, https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D4/log is an example of the failure.

sxa avatar Apr 04 '24 16:04 sxa

Yeah Im seeing this in https://github.com/adoptium/infrastructure/issues/3486#issuecomment-2039497160 too. Not sure what caused docker to reassign ports. Looking into it

Haroon-Khel avatar Apr 05 '24 10:04 Haroon-Khel

Its caused because we now dont specify a port (allowing docker to randomly assign one), https://github.com/adoptium/infrastructure/blob/b728c86a1b2fe798c29cae85f7b23e50ff9686fa/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/deploy_container/tasks/deploy.yml#L27

Then when the dockerhost machine is restarted, docker will randomly assign a port again instead of giving the containers their previous port. TLDR a port needs to be specified on container startup instead of relying on docker to give a random one

Haroon-Khel avatar Apr 05 '24 11:04 Haroon-Khel

That's another thing that won't be a problem if we switch over the connecting the containers over JNLP ;-)

sxa avatar Apr 05 '24 13:04 sxa

The containers are back online (https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log refuses to come back up for other reasons). The problem should not reoccur with the existing containers. I need to change https://github.com/adoptium/infrastructure/blob/b728c86a1b2fe798c29cae85f7b23e50ff9686fa/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/deploy_container/tasks/deploy.yml#L27 to specify a port number to prevent this from happening in the future

Haroon-Khel avatar Apr 05 '24 14:04 Haroon-Khel

Sounds good thanks - Jenkins logs should be clearer now after today's cleanups. Need to wait for Ludovic to come back to fix the RISC-V ones but that should be another load of warnings to disappear from Jenkins 👍

sxa avatar Apr 05 '24 15:04 sxa

The containers are back online (https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log refuses to come back up for other reasons).

Do you know what the reason is? It's "curious" to note that the port number is 32768, exactly 2^15

sxa avatar Apr 08 '24 12:04 sxa

I'm going to close this now. Any future work can happen under other issues if required.

sxa avatar Apr 24 '24 10:04 sxa

https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log is back online, I recreated its container and now the jenkins agent has no trouble connecting

Haroon-Khel avatar Apr 25 '24 11:04 Haroon-Khel