build icon indicating copy to clipboard operation
build copied to clipboard

Infrastructure for Orka (2024 and beyond)

Open UlisesGascon opened this issue 1 year ago • 45 comments

I plan to work on it during the weekend, so I can provide a good overview on the next build meeting on Tuesday.

Current tasks on MacOS infra

  • [x] Cleaned up all the macos machines that are available via SSH
  • [x] Open discussion to apply upgrade in the Firewall: https://github.com/nodejs/build/issues/3642
  • [x] Review Orka state: https://github.com/nodejs/build/issues/3686#issuecomment-2067096613
  • [x] Plan the final allocation for resources in Orka (see: https://github.com/nodejs/build/issues/3686#issuecomment-2067120587)
    • [x] MacOS 13
    • [x] Nearform MacOS Machines (see: https://github.com/nodejs/build/issues/3638)
  • [x] Provide access for @ryanaslett to the infra-mac related resources: https://github.com/nodejs/build/issues/3658
  • [x] Prepare and deliver and onboarding session for @ryanaslett to the Orka infrastructure
  • [x] Add new VMs for MacOS 13 Intel (@ryanaslett)

Blocked until ARM nodes are provided

  • [x] Confirm org decision regarding new ARM nodes (discussion ongoing in the mailing list)
  • [x] Add new VMs for MacOS 13 ARM
  • [ ] Add new VMs for MacOS 11 ARM

UlisesGascon avatar Apr 19 '24 17:04 UlisesGascon

Current Orka state

updated on April 19, 2024

SSH port Node: macpro-4 Node: macpro-5 Node: macpro-6
8822 release-macos11-x64-1 empty test-macos11-x64-1
8823 empty empty test-macos11-x64-2
8824 empty test-macos1015-x64-2 test-macos1015-x64-1
8825 empty empty empty

UlisesGascon avatar Apr 19 '24 18:04 UlisesGascon

Next Orka state

updated on April 22, 2024

Intel Nodes

SSH port Node: macpro-4 Node: macpro-5 Node: macpro-6
8822 release-macos11-x64-1 test-macos13-x64-2 test-macos11-x64-1
8823 test-macos13-x64-1 release-macos13-x64-1 test-macos11-x64-2
8824 empty test-macos1015-x64-2 test-macos1015-x64-1
8825 empty empty empty

ARM Nodes

We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?

SSH port Node: arm-1 Node: arm-2 Node: arm-3
8822 test-macos11-arm64-1 release-macos13-arm64-1 empty
8823 release-macos11-arm64-1 test-macos13-arm64-1 test-macos13-arm64-2

How Nearform machines are "relocated"?

  • release-nearform-macos11.0-arm64-1 -> release-orka-macos11-arm64-1
  • test-nearform-macos11.0-arm64-1 -> test-orka-macos11-arm64-1

UlisesGascon avatar Apr 19 '24 18:04 UlisesGascon

release-macos13-x64-2 release-macos13-arm64-2

I don't think it's necessary to have two identical release machines.

targos avatar Apr 22 '24 14:04 targos

test-nearform-macos11.0-arm64-1

Are these typos?

targos avatar Apr 22 '24 15:04 targos

Great feedback @targos! I updated the tables

I don't think it's necessary to have two identical release machines.

We have space for redundancy, but let's remove them for now.

Are these typos?

I made a better reference for the "relocated" machines

UlisesGascon avatar Apr 22 '24 15:04 UlisesGascon

release-macos13-x64-2 release-macos13-arm64-2

I don't think it's necessary to have two identical release machines.

Actually, I think we should have one x64 and two arm64 machines, because there are two jobs that run on macos-arm64 during a release (osx11-release-pkg and osx11-arm64-release-tar).

targos avatar May 02 '24 13:05 targos

Some questions/thoughts/suggestions:

  1. Requirements Question: Do we still need to support 10.15 and/or 11? from (https://github.com/nodejs/node/blob/main/BUILDING.md#supported-platforms) I see:

Node.js does not support a platform version if a vendor has expired support for it. In other words, Node.js does not support running on End-of-Life (EoL) platforms. This is true regardless of entries in the table below.

And the table lists MacOS 11>.

And that table may be outdated as it seems as though MacOS 11 was EOL as of November 2023 ?

  1. ARM support in Orka:

We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?

https://orkadocs.macstadium.com/docs/apple-arm-based-support confirms this:

IMPORTANT

You can deploy up to 2 VMs per Apple silicon-based node.

  1. From what I can gather macOS infra seems to be brittle, with nodes often running into disk issues/maintenance issues.

https://github.com/nodejs/build/issues/3592 https://github.com/nodejs/build/issues/3685 (https://github.com/nodejs/build/issues?q=is%3Aissue+macos+is%3Aclosed+disk) etc.

My suggestion to avoid Jenkins worker decay is to lean into an ephemeral node strategy so that each build has a fresh Orka instance to run on.

We can do that with the following Jenkins plugin for Orka: https://plugins.jenkins.io/macstadium-orka/#plugin-content-ephemeral-agents

We would first need to set up a packer build process to create our VM images so that Orka would have a baseline image to create: https://orkadocs.macstadium.com/docs/packer

The packer process can leverage our existing ansible playbooks: https://developer.hashicorp.com/packer/integrations/hashicorp/ansible/latest/components/provisioner/ansible.

This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines.

I believe this would end up with us using roughly the same amount of resources, so should be palatable for macstadium to support this transition.

ryanaslett avatar May 17 '24 14:05 ryanaslett

This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines.

+1 from me if Macstadium will support that

mhdawson avatar May 21 '24 15:05 mhdawson

Quick update from our last call with MacStadium:

Next week we will have a new Orka cluster (v3) that includes 2 nodes (Intel and ARM):

  • Mac Studio - G1MC M1M/10/32/16/64GB/2TB/10G
  • Mac mini G4E - i7/3.2Ghz/6C/64G/1T/SSD/10G

Pending:

Dependencies

  • [x] https://github.com/nodejs/build/issues/3860
  • [x] Get a new Orka Cluster running Orka 3 (on going)

:white_check_mark: Setup Jenkins <-> Orka

  • [x] Add documentation in the secrets repo: https://github.com/nodejs-private/secrets/pull/342
  • [x] Create a namespace orka-test for test ci
  • [x] Create a namespace orka-release for release ci
  • [x] Create a service account for test ci sa-jenkins-test
  • [x] Create a service account for release ci sa-jenkins-release
  • [x] Generate tokens for Jenkins to connect to the release namespace
  • [x] Generate tokens for Jenkins to connect to the test namespace
  • [x] Setup the Orka Plugin in Jenkins for test CI: https://ci.nodejs.org/manage/cloud/Orka%20Cluster/
  • [x] Setup the Orka Plugin in Jenkins for release CI: https://ci-release.nodejs.org/manage/cloud/Orka%20Cluster/
  • [x] Add a VPN connection between Jenkins test CI and Orka: https://github.com/nodejs/build/issues/3883
  • [x] Add a VPN connection between Jenkins release CI and Orka: (Waiting until tuesday, see: https://github.com/nodejs/build/issues/3883#issuecomment-2322964741)
  • [x] Ensure the cloud connection in Jenkins test environment
  • [x] Ensure the cloud connection in Jenkins release environment

Current status: Completed.

:white_check_mark: Create Image templates

  • [x] Define images for each MacOS Version/arch and use (Test or release) using packer
    • [x] Team agreement on https://github.com/nodejs/build/issues/3876
    • [x] MacOS 13 (Ventura) test Arm (macos-13-arm-test.pkr.hcl): https://github.com/nodejs/build/pull/3882
    • [x] MacOS 13 (Ventura) test Intel (macos-13-intel-test.pkr.hcl): https://github.com/nodejs/build/pull/3882
    • [x] MacOS 13 (Ventura) release Arm (macos-13-arm-release.pkr.hcl): https://github.com/nodejs/build/pull/3893
    • [x] MacOS 13 (Ventura) release Intel (macos-13-intel-release.pkr.hcl): https://github.com/nodejs/build/pull/3893
    • [x] Update the secrets repo: https://github.com/nodejs-private/secrets/pull/348
  • [x] Add Packer to the Build project: https://github.com/nodejs/build/pull/3872
  • [x] Add GitHub Action to do the validation (NO auto-deployments): https://github.com/nodejs/build/pull/3872
  • [x] Add Packer secrets in to the secrets repo: https://github.com/nodejs-private/secrets/pull/343

Current status: Completed.

:white_check_mark: Trigger Ephemeral VMs from Jenkins

  • [x] Setup Ephemeral nodes from Jenkins on demand for the test CI and probe that they can build and test node.
  • [x] Setup Ephemeral nodes from Jenkins on demand for the release CI and probe that they can build, test and sign node.

Current status: Completed

Jobs and Agents Migration

  • [ ] Add the MacOS13 to the ci test existing jobs (commit-test, night builds, v8, CITGM...) to check that all the current setup can run all the pipelines
  • [ ] Add the MacOS13 to the ci release existing jobs to check that all the current setup can run all the pipelines
  • [ ] Configure Jenkins Plugin to be compatible with MacOS11 and MacOS10.15 -mmacosx-version-min (see: https://github.com/nodejs/build/issues/3876) in the test ci
  • [ ] Configure Jenkins Plugin to be compatible with MacOS11 and MacOS10.15 -mmacosx-version-min (see: https://github.com/nodejs/build/issues/3876) in the release ci
  • [ ] Remove labels from legacy jenkins agents and check that the jobs are working as expected in the test ci
  • [ ] Remove labels from legacy jenkins agents and check that the jobs are working as expected in the release ci

Current status: @UlisesGascon working on the setup.

Clean up

  • [ ] Decommission the Bare metal machines
    • [ ] Open a support ticket
    • [ ] https://github.com/nodejs/build/pull/3871
    • [ ] https://github.com/nodejs/build/pull/3868
  • [ ] Decommission Orka Cluster (Intel Nodes)
    • [ ] Open a support ticket
    • [ ] https://github.com/nodejs/build/pull/3869
    • [ ] https://github.com/nodejs/build/pull/3870
  • [ ] Remove MacOS references from the Ansible playbooks
    • [ ] https://github.com/nodejs/build/pull/3892

Other

  • [ ] Add a good documentation in the Build repo about the current Orka/Jenkins setup
  • [ ] Support Marketing activities related (@UlisesGascon will provide more details)
  • [ ] Elaborate a separate plan for MacOS 14 (Sonoma) and MacOS 15 (Sequoia)
  • [ ] Improve SSH keys management (see: https://github.com/nodejs/build/pull/3896#issuecomment-2340546861)

Deadline The idea is to try to achieve this transition in 30 days.

Important

We don't expect any downtime will doing the migration as we will have a new cluster working on isolation will the current system is in place until we are ready to transfer the operations to the new cluster and then decommission the HW.

Challenges

  • [ ] Lack of support for MacOS 11 ARM in Orka: details
  • [ ] Cannot deploy more than 2 VMs on an ARM host.
  • [ ] Error: admission webhook "vimage.kb.io" denied the request: cannot delete image "macos13-intel-test-latest.img". The image is being used by one or more VMs: vm-ttdzh. Remove the VMs and try again
  • [ ] When Jenkins create a cloud agent if this agent failed then is not removed from the cluster
  • [ ] Evaluate how to use namespace
  • [ ] New VMs are much slower than expected (@UlisesGascon investigating)
  • [ ] Unify the HCL templates into a single one if possible (see: https://github.com/nodejs/build/pull/3906#issuecomment-2348817255)

UlisesGascon avatar Aug 09 '24 14:08 UlisesGascon

Based on the support ticket SERVICE-188003 and the documentation https://orkadocs.macstadium.com/docs/compatibility-versions#macos-and-apple-hardware. Screenshot from 2024-08-27 19-49-39

We might won't be able to migrate MacOS ARM VMs to Orka due lack of support, so we might need to keep the Bare Metal machines.

This needs to be confirmed (cc: @ryanaslett )

UlisesGascon avatar Aug 27 '24 22:08 UlisesGascon

Current status

I will be on PTO from the 19th to the 25th. I made some changes to the templates to add the missing dependencies (https://github.com/nodejs/build/pull/3906).

So, @ryanaslett, in case you want to help with this during my time off:

  • [x] Check why the iojs+release-Ulises-test-orca is not passing. The current error (10:09:52 Makefile:1030: *** No xz command, cannot continue. Stop.) (details) is related to the PATH (I think), as xz is included on all the machines already since my last PR.
  • [ ] Try to modify the base image for release and test machines to include and use Xcode 15.2, based on the discussion with @targos. Documentation
  • [ ] Keep running and patching the other pipelines that we cloned with @mhdawson.

Probably the next errors in the CI will be related to the users; currently, we only have the admin user. Maybe we need to create a separate one like iojs in order to make the CI pipelines work.

UlisesGascon avatar Sep 16 '24 12:09 UlisesGascon

Check why the iojs+release-Ulises-test-orca is not passing. The current error (10:09:52 Makefile:1030: *** No xz command, cannot continue. Stop.) (details) is related to the PATH (I think), as xz is included on all the machines already since my last PR.

Started in on this.

The PATH variable is set on the existing macos machines via the script that launches the jenkins agent: This template: https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/templates/start.j2#L10 Creates a script here: https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/tasks/main.yml#L179-L185 And this Template: https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/templates/org.nodejs.osx.jenkins.plist Gets put into /Library/LaunchDaemons https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/vars/main.yml#L33-L37

I've added ARCH, DESTCPU, and PATH to the Environment variables to the Orka Cluster Cloud Template configurations on ci-release machine.

The osx13-x64-release-tar job worked and signed the tarball, but failed to push the release to node-www, so, need to adjust that next.

image

ryanaslett avatar Oct 03 '24 17:10 ryanaslett

We need this config in the image: https://github.com/nodejs/build/blob/main/ansible/roles/release-builder/files/ssh_config

targos avatar Oct 03 '24 18:10 targos

We need this config in the image: https://github.com/nodejs/build/blob/main/ansible/roles/release-builder/files/ssh_config

node-www also has a ufw2 firewall and will not allow connections from ip addresses not on the allowlist.

richardlau avatar Oct 03 '24 19:10 richardlau

I've added the main orka address to the ufw2 firewall on node-www (199.7.167.98) I've confirmed that this is the address that ephemeral nodes will all appear as to node-www.

ryanaslett avatar Oct 04 '24 13:10 ryanaslett

I've requested the new nodes from MacStadium to fill out the rest of our capacity, and got a response today that they are aiming to have the nodes installed by Wed, Oct 30th.

ryanaslett avatar Oct 28 '24 18:10 ryanaslett

Great to see the details and progress on this front.

One thought is that once everything is landed it would be great to do a deep dive session for other build team members who are interested in learning a bit more about now it works.

mhdawson avatar Oct 29 '24 13:10 mhdawson

It's been a month now. Did we get the new nodes?

targos avatar Nov 28 '24 10:11 targos

I'm sorry to insist, but I don't know what else to do to move this forward :(

targos avatar Dec 09 '24 12:12 targos

@ryanaslett, @UlisesGascon any update on this?

mhdawson avatar Dec 10 '24 21:12 mhdawson

Hi, yes, we did get the nodes, but havent fully transitioned testing over to using them as there was still an open question about whether or not we had the right xcode and OS versions.

Apologies that I didnt see your question earlier.

I've been wrapping up some other OpenJS project stuff for the end of the year, but I can refocus on this once more to make sure its in a stable situation.

ryanaslett avatar Dec 10 '24 21:12 ryanaslett

Existing macOS machines (due to outdated clang/gcc versions) are blocking 4 pull-requests:

  • https://github.com/nodejs/node/pull/56218
  • https://github.com/nodejs/node/pull/56080
  • https://github.com/nodejs/node/pull/56063
  • https://github.com/nodejs/node/pull/54990

Should we start talking about lowering the support tier of macOS? The oldest PRs are from September 17 (almost 3 months ago).

anonrig avatar Dec 11 '24 01:12 anonrig

Some interesting news, coming from https://github.com/nodejs/node-v8/issues/295 and a Slack chat with @joyeecheung:

  • V8 is currently built with Xcode 16.1 (macOS SDK 15.1), with a compatibility target of macOS 11.0 (source: https://source.chromium.org/chromium/chromium/src/+/main:build/config/mac/mac_sdk.gni)
  • We have the same compatibility target
  • In the release CI, we have two different Xcode versions (source: https://ci-release.nodejs.org/job/iojs+release/10687)
    • osx13-x64-release-tar uses version 16.0 (clang-1600.0.26.3). Builds of Node.js with latest V8 are successful with it.
    • osx13-arm64-release-tar uses version 14.3 (clang-1403.0.22.14.1). Builds of Node.js with latest V8 fail with it.

That said, I suggest:

targos avatar Dec 21 '24 11:12 targos

Note that officially (according to https://developer.apple.com/download/applications/), Xcode 16.1 requires at least macOS 14.5 to run, and according to Wikipedia, Xcode 16.0 did too. So I don't know how the osx13-x64-release-tar job is able to run, but it may be risky not to upgrade macOS to a supported version.

targos avatar Dec 21 '24 12:12 targos

I left my machine that has macOS 13 + Apple Clang 14 now so can't provide more details until after the holidays but FWIW: when I tried to install the latest system update for 13, the only available update was upgrading to Sequoia, and nothing else showed up when I tried to look for last compatible update of XCode or command line tools with App Store or Software Update/softwareupdate --list. If somehow it is possible to run macOS 13 with XCode 16 we should likely need to document how to install it, or contributors on macOS 13 may have a hard time getting it to build (or if it just doesn't work then we need to tell contributors to upgrade to Sequoia).

joyeecheung avatar Dec 21 '24 13:12 joyeecheung

This is how we manually install Xcode on the build machines: https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#full-xcode

targos avatar Dec 21 '24 13:12 targos

Also my 2cents: V8 uses (almost) tip of tree clang, so that's currently clang 20, and they have been doing a lot of C++ modernization that lower versions of clang aren't very good at parsing. I did quite a few patching to make V8 build on macOS 13 and Clang 14 in https://github.com/joyeecheung/node/tree/fix-macos-13 and many of the fixes don't look very acceptable in the upstream because they basically just revert the modernization. If we are upgrading the build system the least friction route would probably be to just require Sequoia and XCode 16 to build, though we can keep targeting 11. The lower macOS version we need to support, the harder it is to install higher versions of Apple Clang on it, and the C++ feature gap will keep widening as V8 uses ToT Clang.

joyeecheung avatar Dec 21 '24 13:12 joyeecheung

If somehow it is possible to run macOS 13 with XCode 16 we should likely need to document how to install it, or contributors on macOS 13 may have a hard time getting it to build (or if it just doesn't work then we need to tell contributors to upgrade to Sequoia)

For the new Orka machines, we are using Packer, and the instructions include some manual steps on how to install it that are replicable for local machines as well:
https://github.com/nodejs/build/tree/main/orka/templates#manual-steps-for-the-release-images.

We probably want to update the commands and ensure that we are using the correct version 👍

UlisesGascon avatar Dec 21 '24 14:12 UlisesGascon

Is there any update/progress on this issue?

anonrig avatar Jan 12 '25 02:01 anonrig

Let me ping @ryanaslett! AFAIK we were testing the new ephemeral instances and waiting for a HW upgrade in the new cluster so we can decommission the old VMs and move all the workloads for both CI environments, but not sure if this was completed or not.

UlisesGascon avatar Jan 20 '25 20:01 UlisesGascon