build icon indicating copy to clipboard operation
build copied to clipboard

Hetzner Benchmarking Machine Replacements

Open ryanaslett opened this issue 1 year ago • 55 comments

The procurement process has completed, and I have created two EX44's (https://www.hetzner.com/dedicated-rootserver/ex44/) at Hetzner.

If all goes well we should be able to have these online and running benchmark tests.

I believe the next steps are

  1. Get access to the secrets repo
  2. Configure machines with ansible
  3. Get admin for the ci.nodejs.org to add them to jenkins
  4. test them

ryanaslett avatar Mar 21 '24 00:03 ryanaslett

@mcollina is it ok for performance measurements to have a CPU with a mix of non-identical cores?

targos avatar Mar 21 '24 06:03 targos

It should be possible to schedule certain processes only on a subset of the cores with taskset.

This should be part of the testing phase.

mcollina avatar Mar 21 '24 08:03 mcollina

These machines have been provisioned and added to Jenkins with the same labels/configs as the former nearform machines.

Next steps: Can somebody with permssions can kick off a benchmarking jobs and some v8 builds to verify that all is working as intended?

ryanaslett avatar Mar 26 '24 02:03 ryanaslett

I've marked the two Nearform machines offline in Jenkins and started a V8 build which is running on test-hetzner-ubuntu2204-x64-1: https://ci.nodejs.org/job/node-test-commit-v8-linux/5876/nodes=benchmark-ubuntu2204-intel-64,v8test=v8test/

richardlau avatar Mar 26 '24 02:03 richardlau

I've marked the two Nearform machines offline in Jenkins and started a V8 build which is running on test-hetzner-ubuntu2204-x64-1: https://ci.nodejs.org/job/node-test-commit-v8-linux/5876/nodes=benchmark-ubuntu2204-intel-64,v8test=v8test/

This has failed:

02:47:26 + DEPOT_TOOLS_DIR=/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/_depot_tools
02:47:26 + PATH=/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/_depot_tools:/home/iojs/build/workspace/node-test-commit-v8-linux/depot_tools:/home/iojs/venv/bin:/home/iojs/nghttp2/src:/home/iojs/wrk:/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin tools/dev/v8gen.py x64.release --no-goma
02:47:26 
02:47:26 Hint: You can raise verbosity (-vv) to see the output of failed commands.
02:47:26 
02:47:26 Traceback (most recent call last):
02:47:26   File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 309, in <module>
02:47:26     sys.exit(gen.main())
02:47:26   File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 303, in main
02:47:26     return self._options.func()
02:47:26   File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 162, in cmd_gen
02:47:26     self._call_cmd([
02:47:26   File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 211, in _call_cmd
02:47:26     output = subprocess.check_output(
02:47:26   File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
02:47:26     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
02:47:26   File "/usr/lib/python3.10/subprocess.py", line 526, in run
02:47:26     raise CalledProcessError(retcode, process.args,
02:47:26 subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/mb/mb.py', 'gen', '-f', 'infra/mb/mb_config.pyl', '-m', 'developer_default', '-b', 'x64.release', 'out.gn/x64.release']' returned non-zero exit status 1.
02:47:26 make: *** [Makefile:303: v8] Error 1

Logging into the machine and running the failing command with -vv (as suggested):

iojs@test-hetzner-ubuntu2204-x64-1:~/build/workspace/node-test-commit-v8-linux/deps/v8$ PATH=/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/_depot_tools:/home/iojs/build/workspace/node-test-commit-v8-linux/depot_tools:/home/iojs/venv/bin:/home/iojs/nghttp2/src:/home/iojs/wrk:/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin tools/dev/v8gen.py x64.release --no-goma -vv
################################################################################
/usr/bin/python3 -u tools/mb/mb.py gen -f infra/mb/mb_config.pyl -m developer_default -b x64.release out.gn/x64.release

  Writing """\
  dcheck_always_on = false
  is_debug = false
  target_cpu = "x64"
  """ to /home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/out.gn/x64.release/args.gn.

  /home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/buildtools/linux64/gn gen out.gn/x64.release --check
    -> returned 1
  ERROR at //build/config/linux/pkg_config.gni:104:17: Script returned non-zero exit code.
      pkgresult = exec_script(pkg_config_script, args, "json")
                  ^----------
  Current dir: /home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/out.gn/x64.release/
  Command: python3 /home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/build/config/linux/pkg-config.py -s /home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/build/linux/debian_bullseye_amd64-sysroot -a x64 glib-2.0 gmodule-2.0 gobject-2.0 gthread-2.0
  Returned 1.
  stderr:

  Traceback (most recent call last):
    File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/build/config/linux/pkg-config.py", line 247, in <module>
      sys.exit(main())
    File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/build/config/linux/pkg-config.py", line 142, in main
      prefix = GetPkgConfigPrefixToStrip(options, args)
    File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/build/config/linux/pkg-config.py", line 80, in GetPkgConfigPrefixToStrip
      prefix = subprocess.check_output([options.pkg_config,
    File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
      return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "/usr/lib/python3.10/subprocess.py", line 503, in run
      with Popen(*popenargs, **kwargs) as process:
    File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
      self._execute_child(args, executable, preexec_fn, close_fds,
    File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
      raise child_exception_type(errno_num, err_msg, err_filename)
  FileNotFoundError: [Errno 2] No such file or directory: 'pkg-config'

  See //build/config/linux/BUILD.gn:58:3: whence it was called.
    pkg_config("glib") {
    ^-------------------
  See //build/config/compiler/BUILD.gn:300:18: which caused the file to be included.
      configs += [ "//build/config/linux:compiler" ]
                   ^------------------------------
  GN gen failed: 1
Traceback (most recent call last):
  File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 309, in <module>
    sys.exit(gen.main())
  File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 303, in main
    return self._options.func()
  File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 162, in cmd_gen
    self._call_cmd([
  File "/home/iojs/build/workspace/node-test-commit-v8-linux/deps/v8/tools/dev/v8gen.py", line 211, in _call_cmd
    output = subprocess.check_output(
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/mb/mb.py', 'gen', '-f', 'infra/mb/mb_config.pyl', '-m', 'developer_default', '-b', 'x64.release', 'out.gn/x64.release']' returned non-zero exit status 1.
iojs@test-hetzner-ubuntu2204-x64-1:~/build/workspace/node-test-commit-v8-linux/deps/v8$

richardlau avatar Mar 26 '24 03:03 richardlau

Looking at the other nearform benchmarking machines I see a considerable amount of manual package installations that are outside of ansible's setup.

There's at least 220 packages that have been manually installed.

Theres also the likelyhood that I was supposed to configure ansible differently to set up these machines beyond what I understood.

I attempted to add the 'is_benchmark' = true to add the benchmark role as I see that pkg-config is installed that way, but some of the other packages in that role have been dropped from ubuntu and that role hasnt been updated for six years, so its probably got some stale packages in there.

List of missing packages
apport-symptoms
aptitude
apt-transport-https
bind9-host
bison
bsdmainutils
build-essential
busybox-initramfs
bzip2
ca-certificates
clang
cloud-guest-utils
cloud-initramfs-copymods
cloud-initramfs-dyn-netconf
console-setup-linux
coreutils
cryptsetup-bin
debianutils
dh-python
distro-info-data
dmeventd
dmsetup
dnsmasq-base
dns-root-data
dnsutils
dpkg
eject
file
flex
ftp
gawk
geoip-database
gettext-base
git-man
gnutls-bin
gpgv
groff-base
grub-common
grub-legacy-ec2
grub-pc
ifenslave
ifupdown
initramfs-tools
initramfs-tools-bin
initramfs-tools-core
init-system-helpers
install-info
isc-dhcp-common
keyboard-configuration
klibc-utils
krb5-locales
language-selector-common
libacl1
libapparmor1
libatm1
libattr1
libaudit1
libaudit-common
libblkid1
libbsd0
libbz2-1.0
libc6
libcap2
libcap2-bin
libcap-dev
libc-ares-dev
libc-bin
libcunit1-dev
libcurl4
libdb5.3
libdevmapper1.02.1
libdevmapper-event1.02.1
libdumbnet1
liberror-perl
libestr0
libev-dev
libevent-dev
libexpat1
libfdisk1
libfribidi0
libgcrypt20
libglib2.0-data
libgmp10
libgnutls30
libgnutls-openssl27
libgpg-error0
libgpm2
libjansson-dev
libkeyutils1
libklibc
libkmod2
liblocale-gettext-perl
liblxc1
liblz4-1
liblzma5
liblzo2-2
libmagic1
libmnl0
libmount1
libmspack0
libncurses5
libncursesw5
libnetfilter-conntrack3
libnewt0.52
libp11-kit0
libpam0g
libpam-modules
libpam-modules-bin
libpam-runtime
libpcre3
libpolkit-agent-1-0
libpopt0
libpython3-stdlib
libreadline6
libsasl2-modules
libseccomp2
libselinux1
libsemanage-common
libsigsegv2
libslang2
libsmartcols1
libsqlite3-0
libss2
libssl-dev
libstdc++6
libsystemd0
libtasn1-6
libtext-charwidth-perl
libtext-iconv-perl
libtext-wrapi18n-perl
libtinfo5
libudev1
libusb-0.1-4
libustr-1.0-1
libutempter0
libuuid1
libwrap0
libx11-data
libxml2-dev
linux-base
linux-generic
linux-headers-generic
lsb-base
ltrace
lxcfs
makedev
mime-support
mlocate
ncurses-base
ncurses-term
ntfs-3g
openjdk-8-jre-headless
openssh-sftp-server
openssl
pastebinit
perl
perl-base
pkg-config
policykit-1
popularity-contest
powermgmt-base
python2
python3.7-distutils
python3-apport
python3-apt
python3-chardet
python3-commandnotfound
python3-dbus
python3-debian
python3-distupgrade
python3-gdbm
python3-gi
python3-minimal
python3-newt
python3-pkg-resources
python3-problem-report
python3-pycurl
python3-requests
python3-setuptools
python3-six
python3-software-properties
python3-systemd
python3-update-manager
python3-urllib3
python-apt-common
r-base
readline-common
rename
resolvconf
run-one
sed
sgml-base
shared-mime-info
snap-confine
snapd
squashfs-tools
tar
tasksel
tcpd
telnet
traceroute
ubuntu-cloudimage-keyring
ubuntu-core-launcher
ubuntu-minimal
ubuntu-standard
ucf
uidmap
unzip
util-linux
vim-common
vim-runtime
vlan
xauth
xdg-user-dirs
xkb-data
xml-core
zerofree
zlib1g
zlib1g-dev

Take a look at the above list and we should decide if we need to update ansible to include some of this setup or not.

ryanaslett avatar Mar 26 '24 06:03 ryanaslett

I suspect it is the is_benchmarking variable that installed those packages, and that is has bit rotted. For example, pkg-config (the missing thing in https://github.com/nodejs/build/issues/3657#issuecomment-2019300353) is listed https://github.com/nodejs/build/blob/51ad7781625e54981319e2dabe28f97ac38699ea/ansible/roles/benchmarking/vars/main.yml#L10.

We should probably add pkg-config to https://github.com/nodejs/build/blob/51ad7781625e54981319e2dabe28f97ac38699ea/ansible/roles/build-test-v8/tasks/partials/ubuntu2204.yml#L9 as it appears needed to build V8.

I'm less familiar with what is needed to run the benchmarks.

richardlau avatar Mar 26 '24 13:03 richardlau

I attempted to add the 'is_benchmark' = true to add the benchmark role as I see that pkg-config is installed that way, but some of the other packages in that role have been dropped from ubuntu and that role hasnt been updated for six years, so its probably got some stale packages in there.

Adding pkg-config this way has allowed the V8 CI to build and run tests.

  • Node.js main branch failed due to https://github.com/nodejs/node/issues/51308: https://ci.nodejs.org/job/node-test-commit-v8-linux/5878/nodes=benchmark-ubuntu2204-intel-64,v8test=v8test/consoleFull
  • Node.js 18 branch failed due to missing perf: https://ci.nodejs.org/job/node-test-commit-v8-linux/5879/nodes=benchmark-ubuntu2204-intel-64,v8test=v8test/console

Neither encountered the networking issues we had with the Nearform hosted benchmark machines (🎉).

For the missing perf, maybe we need to add the hetzner machines to: https://github.com/nodejs/build/blob/51ad7781625e54981319e2dabe28f97ac38699ea/ansible/playbooks/jenkins/worker/create.yml#L74-L77

richardlau avatar Mar 26 '24 14:03 richardlau

For the missing perf, maybe we need to add the Hetzner machines to:

That intel tag was what was being used to target the Nearform intel donated machines. We should definitely change that to target the Hetzner ones now.

Adding pkg-config this way

How shall we approach getting these machines into a stable usable state going forward? I can continue to adjust which packages are installed as part of the ansible setup, but I don't want to inadvertently step on or undo any work that anybody else is doing. (Though I also lack any background in what the jobs do/accomplish)

ryanaslett avatar Mar 26 '24 15:03 ryanaslett

Ran the benchmark job which fails - https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1498/

Looks like there may be a directory missing. That may have been created manually as these machines were set up a long time ago.

mhdawson avatar Mar 26 '24 17:03 mhdawson

@ryanaslett have you added the linux-perf role? That might be all that is needed to get the v8 jobs running as well as they were before on the machines. @richardlau is that your expectation?

mhdawson avatar Mar 26 '24 17:03 mhdawson

manually creating the directory /w owned by iojs and with group iojs has let the benchmark run get further

mhdawson avatar Mar 26 '24 17:03 mhdawson

@ryanaslett if you are updating the ansible scripts, is there a section which is specific to the benchmark machines that we can add the creation of the /w directory owned by iojs with group iojs?

mhdawson avatar Mar 26 '24 17:03 mhdawson

Job to see if perf job runs ok after adding the /w directory - https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1499/

mhdawson avatar Mar 26 '24 17:03 mhdawson

@ryanaslett if you are updating the ansible scripts, is there a section which is specific to the benchmark machines that we can add the creation of the /w directory owned by iojs with group iojs?

Thats sort of what I was asking in https://github.com/nodejs/build/issues/3657#issuecomment-2020757948 - mostly who should be doing this.

ryanaslett avatar Mar 26 '24 18:03 ryanaslett

And also, whether we're trying to capture every change in ansible, or just doing some manual steps. (which we should still document)

ryanaslett avatar Mar 26 '24 18:03 ryanaslett

I went ahead and modified the jenkins worker config to target the hetzner machines, modified the benchmarking role to remove any packages that are not currently installed (mostly python 2 packages that are no longer on ubuntu)

There is still the question of "all the rest of the packages".

Seems like we can either

  1. keep retrying the builds and iteratively fix each issue as it arises until we have working builds
  2. re-install all the missing packages to have parity, even if it means we have excess/unnecessary bloat

ryanaslett avatar Mar 26 '24 18:03 ryanaslett

@ryanaslett I suspect that most of the missing packages will not e needed. Hoping @richardlau can confirm that for the V8 benchmarking part and the jobs I'm kicking off should help see if that is true for running the benchmarking job.

The last one I kicked off failed because I used the same parameters as the last run, but that PR has landed since then and therefore there were conflicts.

Kicked off this one to see if it passes on a fress PR - https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1500/

mhdawson avatar Mar 26 '24 19:03 mhdawson

And also, whether we're trying to capture every change in ansible, or just doing some manual steps. (which we should still document)

We should capture everything in ansible.

The manual change to create the /w directory I made is just to see if that actually resolves the problem or not. We should capture it in ansbile for the benchmarking machines.

Next time hopefully we can just run the ansible script and everything will work afterwards.

mhdawson avatar Mar 26 '24 19:03 mhdawson

@ryanaslett not necessarily something that needs to be resolved immediately, but are there any Hertzner machines where the CPUs are all the same? It seems we'd have to not use either the 4 Pcores or 8 ecores for the benchmark tests. I guess I'd leave it up to @mcollina, and @anonrig to comment on wether that number of cores will be ok or not.

mhdawson avatar Mar 26 '24 19:03 mhdawson

@ryanaslett not necessarily something that needs to be resolved immediately, but are there any Hertzner machines where the CPUs are all the same? It seems we'd have to not use either the 4 Pcores or 8 ecores for the benchmark tests. I guess I'd leave it up to @mcollina, and @anonrig to comment on wether that number of cores will be ok or not.

I was under the understanding that it would be worked around using taskset per: https://github.com/nodejs/build/issues/3657#issuecomment-2011665825

ryanaslett avatar Mar 27 '24 02:03 ryanaslett

are there any Hetzner machines where the CPUs are all the same?

I believe @mcollina had a solution for that in https://github.com/nodejs/build/issues/3657#issuecomment-2011665825 with using taskset , but that's really the goal of this current exercise is to ensure that the servers we have will work as replacements for the servers we had before.

ryanaslett avatar Mar 27 '24 07:03 ryanaslett

I have wrangled both the benchmarking role and linux-perf roles into an state that at least completes the ansible playbook now for both machines.

Can somebody kick off another build to try it out? Im still lacking jenkins admin.

ryanaslett avatar Mar 27 '24 08:03 ryanaslett

@ryanaslett @mhdawson we'd need to change the job so that benchmarks are run using taskset. I have no idea on how to wire it to the benchmark jenkins jobs.

mcollina avatar Mar 27 '24 08:03 mcollina

From a V8 CI POV the CI is failing on the new machine, but I think those are known issues that were unfortunately semi-masked by https://github.com/nodejs/build/issues/3050 (which isn't occurring on the Hetzner machines 🎉):

  • https://github.com/nodejs/node/issues/50079
  • https://github.com/nodejs/node/issues/51308

cc @nodejs/v8-update

In other words, for the V8 CI, the new machines are in no worse state that the Nearform machines that are being replaced.

richardlau avatar Mar 27 '24 17:03 richardlau

I posted a script on how to download the right version of Linux source code to build perf, that'll probably make the perf test failures go away, hoping someone who knows how to convert it to ansible pick up the rest.. https://github.com/nodejs/node/issues/50079#issuecomment-2023362901

joyeecheung avatar Mar 27 '24 17:03 joyeecheung

@mcollina the script that runs the benchmarking jobs is in - https://github.com/nodejs/benchmarking/blob/master/experimental/benchmarks/community-benchmark/run.sh. The person who wrote it is long gone so I think somebody from @nodejs/performance is going to need to figure out how to inject taskset if we want to solve the problem that way. My point was that even if we do that it means only a fraction of the machine can be used for the performance run. I was not sure if that made sense or not.

mhdawson avatar Mar 27 '24 17:03 mhdawson

from

Kicked off this one to see if it passes on a fress PR - https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1500/

Seems like one last thing is missing from the machines:

17:43:37 "new","crypto/webcrypto-digest.js","n=100000 method='SHA-512' data=100 sync='subtle'",275586.54020345735,0.362862424
17:43:37 ++ cat output260324-202819.csv
17:43:37 ++ Rscript benchmark/compare.R
17:43:37 benchmarking/experimental/benchmarks/community-benchmark/run.sh: line 106: Rscript: command not found
17:43:38 Build step 'Execute shell' marked build as failure

mhdawson avatar Mar 27 '24 17:03 mhdawson

@ryanaslett it seems like the ansible scripts should be installing Rscript

https://github.com/nodejs/build/blob/11ee0afd99dd01881a2537e6cf4dd358b1989916/ansible/roles/benchmarking/tasks/main.yml#L36

- name: Install Rscript repo | {{ os }}
  when: os|startswith("ubuntu")
  shell: echo "deb https://ftp.heanet.ie/mirrors/cran.r-project.org/bin/linux/ubuntu {{ ansible_distribution_release }}-cran40/" > /etc/apt/sources.list.d/r.list

- name: Add R key
  apt_key:
    keyserver: keyserver.ubuntu.com
    id: E084DAB9

- name: Update keys
  shell: "apt-key update"

- name: Update packages
  include_role:
    name: package-upgrade

- name: Install Rscript packages
  package:
    name: "{{ package }}"
    state: present
  loop_control:
    loop_var: package
  with_items:

Did you run ansible with the benchmarking role ?

mhdawson avatar Mar 27 '24 17:03 mhdawson

I think we need to change https://github.com/nodejs/node/blob/af48641993f5212f521a52e787e4ba7470551b37/benchmark/compare.js#L73-L75 and https://github.com/nodejs/node/blob/af48641993f5212f521a52e787e4ba7470551b37/benchmark/run.js#L43-L46 to allow spinning up executables runs with taskset to pin them on a CPU.

mcollina avatar Mar 27 '24 18:03 mcollina