go icon indicating copy to clipboard operation
go copied to clipboard

x/build/env/darwin-arm64: run buildlets in a per-build VM

Open toothrot opened this issue 2 years ago • 5 comments

Currently, the darwin-arm64 buildlets do not run in a clean VM for each test run. Now that QEMU has progressed a bit, we should try running them in a VM again.

toothrot avatar Oct 13 '21 16:10 toothrot

The state of support for virtualization of darwin/arm64 is poor. There is no solution in the near future for hosting a Mac VM on ARM64 in our current environment (VMWare on MacStadium) or in our qemu-hvm environments. This is due to a number of issues, not limited to the fact that darwin/arm64 uses an iOS like boot-process instead of a typical PC boot process.

However, the AWS buildlets may work well for us. The workflow of creating a macOS instance on AWS is roughly:

  • purchase 24-hr leases of "Dedicated Hosts", marked as auto-assignable
  • Create mac1.metal VMs that get auto-assigned to the host pool

This comes with some caveats. After an instance is stopped, the Dedicated Host is apparently re-imaged by AWS (in a "Pending" state for about an hour). This means, an instance per-buildlet is prohibitively expensive at our volume, as a downtime of an hour between builds would necessitate a very large pool of Dedicated Hosts for our hundreds of darwin builds per day that we do.

We could avoid some of the re-image downtime issue by re-using instances between builds. This has a bit of an issue with keeping environments pristine, but we may be able to automate some of that away with APFS filesystem snapshots, and retiring the instances on some interval.

Finally, for releaselets, we could ensure that we always use a fresh VM that hasn't been tainted by previous builds running on it.

This is less featureful than our always-fresh image approach that we have on MacStadium / VMWare, but it would allow for a somewhat-clean environment for our darwin/ARM64 builders without too much overhead.

Finally, this approach could be re-used for all of our mac builders. This would have a massive benefit of reducing the effort needed to build and maintain new macOS images for each release, which has now doubled with two processor architectures. Our existing approach for AWS AMIs using Packer should work nicely in this scenario.

toothrot avatar Dec 03 '21 22:12 toothrot

Documentation on automatic Dedicated Host provisioning and Releasing: https://docs.aws.amazon.com/license-manager/latest/userguide/host-resource-groups.html

toothrot avatar Dec 03 '21 22:12 toothrot

I'm starting prototyping with this testing out AWS Mac instances, first by creating some standard reverse builders.

prattmic avatar Sep 13 '22 19:09 prattmic

https://go.dev/cl/430696 added an AWS darwin-amd64 reverse builder (darwin-amd64-12-aws), which I've set up manually with a launchd service.

It actually almost "just works". All of the x/ repos seem to be passing. The main Go repo fails on a specific test, os/signal.TestDetectNohup and TestNohup: https://build.golang.org/log/bee053b56d2f0d612725e4427f5640cdad5cad34

This failure is oddly specific: the system /usr/bin/nohup binary is unhappy. We've had this issue before in #5135 inside of tmux. This is also a common problem online, though frustratingly I have yet to find a concrete description of the problem, just workarounds.

One mention is that sshd must have PAM enabled. The AWS sshd config does indeed disable PAM, so that may be related. (Though my launchd service isn't run through ssh, so it isn't clear how that is related).

I'm going to investigate running QEMU on these instances, so I'll pause investigation into os/signal for now, since it may just not be an issue in QEMU guests.

prattmic avatar Sep 14 '22 18:09 prattmic

Change https://go.dev/cl/432115 mentions this issue: cmd/buildet: allow halt of macOS QEMU VMs

gopherbot avatar Sep 20 '22 18:09 gopherbot

Change https://go.dev/cl/432396 mentions this issue: dashboard: increase hast-darwin-amd64-12-aws count

gopherbot avatar Sep 21 '22 15:09 gopherbot

Change https://go.dev/cl/432395 mentions this issue: env/darwin: AWS darwin instances

gopherbot avatar Sep 21 '22 15:09 gopherbot

We now have three hosts running six reverse builder VMs fully set up and (almost) ready.

There is one failing test (https://build.golang.org/log/c40b5c45d0dc28318fd9ad0149efddfe39ff27d7) because an extra deprecation warning printed by bash.

Once the builds are working next steps (short and long term) will be:

  • Create instances for older amd64 macOS releases.
  • Create instances for arm64 macOS releases (if desired).
  • Investigating putting instances behind NAT. Currently they have public IPs, but all inbound connections blocked.
  • Investigate smarter scheduling like makemac. Right now we just create guests in a loop with a fixed guest OS version.

prattmic avatar Sep 21 '22 20:09 prattmic

Change https://go.dev/cl/432857 mentions this issue: dashboard: add all AWS darwin-amd64 builders

gopherbot avatar Sep 22 '22 17:09 gopherbot

Change https://go.dev/cl/432856 mentions this issue: env/darwin/aws: don't quote extra args

gopherbot avatar Sep 22 '22 17:09 gopherbot

Change https://go.dev/cl/432860 mentions this issue: env/darwin/aws: update docs

gopherbot avatar Sep 23 '22 21:09 gopherbot

Change https://go.dev/cl/432859 mentions this issue: dashboard: make darwin-amd64-aws race builder actually run race

gopherbot avatar Sep 23 '22 21:09 gopherbot

Change https://go.dev/cl/442255 mentions this issue: env/darwin/aws: switch to vmnet-shared networking

gopherbot avatar Oct 11 '22 16:10 gopherbot

Change https://go.dev/cl/448435 mentions this issue: dashboard: add darwin 13 (Ventura) amd64 builders on AWS

gopherbot avatar Nov 07 '22 18:11 gopherbot

Change https://go.dev/cl/449877 mentions this issue: cmd/runqemubuildlet: add darwin support

gopherbot avatar Nov 11 '22 19:11 gopherbot

Change https://go.dev/cl/449876 mentions this issue: cmd/runqemubuildlet: select windows support with a flag

gopherbot avatar Nov 11 '22 19:11 gopherbot

Change https://go.dev/cl/449875 mentions this issue: env/darwin/aws: assign static IPs to each guest

gopherbot avatar Nov 11 '22 19:11 gopherbot

Change https://go.dev/cl/453956 mentions this issue: dashboard,internal/releasetargets: run AMD64 Macs AWS, build 1.20 with 13

gopherbot avatar Nov 29 '22 18:11 gopherbot

Change https://go.dev/cl/456055 mentions this issue: cmd/runqemubuildlet: use sudo kill to signal on darwin

gopherbot avatar Dec 07 '22 20:12 gopherbot

Change https://go.dev/cl/456042 mentions this issue: cmd/runqemubuildlet: run as root on darwin

gopherbot avatar Dec 08 '22 19:12 gopherbot

Currently our AWS darwin-amd64 builder guests take ~4 minutes to boot. This is much slower than guests on MacStadium (I'm told those were closer to 10s). 4 minute boot time is a significant drag on capacity. Subrepo tests are often much shorter than that, meaning we spend more time booting than we do running tests.

I dug into this a bit yesterday:

  • Disk performance does not seem to be a bottleneck. Moving the disk image from the root EBS volume we currently use to the internal Mac SSD had no impact.
  • QEMU's hvf (i.e., macOS) backend seems to have scalability problems w.r.t. number of guest CPUs. Boot times with different CPU counts (we currently use 6):
    • 1 CPU -> ~1m30s
    • 2 CPU -> ~1m15s
    • 4 CPU -> ~1m40s
    • 6 CPU -> ~4m
    • 8 CPU -> ~6m

image

A profile shows ~15% of all cycles spent in hvf_vcpu_exec -> qemu_mutex_lock_iothread / qemu_mutex_unlock_iothread, which sounds like lock contention to me. Indeed, this lock is pretty much held unconditionally for the duration of all VM exits: https://gitlab.com/qemu-project/qemu/-/blob/master/target/i386/hvf/hvf.c#L453. OTOH, the KVM backend avoids taking this lock for many (but not all) VM exit reasons.

I sent a mail to [email protected] about this, but I'm not sure it went through, as it doesn't appear on the mailing list archive.

Anyways, a quick improvement will be to switch to 4 CPUs, which hopefully is still enough to avoid test timeouts.

prattmic avatar Jan 11 '23 14:01 prattmic

I added tracing to QEMU, and these are the VM exit reason counts during boot:

25338097  hvf_vcpu_exit: exit reason 48 (EPT violation)
1465860   hvf_vcpu_exit: exit reason 7  (Interrupt window)
955636    hvf_vcpu_exit: exit reason 1  (External interrupt)
532542    hvf_vcpu_exit: exit reason 12 (HLT instruction)
80699     hvf_vcpu_exit: exit reason 30 (IO instruction)
3485      hvf_vcpu_exit: exit reason 10 (CPUID)
1597      hvf_vcpu_exit: exit reason 31 (RDMSR)
117       hvf_vcpu_exit: exit reason 28 (CR access)
69        hvf_vcpu_exit: exit reason 32 (WRMSR)
7         hvf_vcpu_exit: exit reason 55 (XSETBV)

The only one here that surprises me is HLT exits. It looks like XNU may use this to enter an idle state when a more complex power management subsystem is not (yet?) available: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/pmCPU.c#L176

prattmic avatar Jan 11 '23 22:01 prattmic

Change https://go.dev/cl/461775 mentions this issue: env/darwin/aws: reduce guest CPU count to 4

gopherbot avatar Jan 12 '23 15:01 gopherbot

For reference, buildlet wait times (get_buildlet, seconds):

Before AWS switch (2022-10-15 through 2022-11-29):

Builder p10 p50 p90 p99
darwin-amd64-10_14 0.035087506 57.426122709 5409.986448872 27029.738440022
darwin-amd64-10_15 0.035315248 51.394408839 6330.141066737 33205.196756675
darwin-amd64-11_0 0.03502191 17.052950941 1265.639361792 18768.733989156
darwin-amd64-12_0 0.036329005 113.757215006 27884.264082743 60902.939780661
darwin-amd64-nocgo 0.036361308 152.538390709 33514.425559542 60922.292660581

After AWS switch (2022-11-30 through 2023-01-12):

Builder p10 p50 p90 p99
darwin-amd64-10_14 0.030347397 129.170084819 16868.565147715 51138.059929517
darwin-amd64-10_15 0.032052233 165.560549426 21411.865571222 53629.770950752
darwin-amd64-11_0 0.03205239 221.267568342 26238.450715285 64911.760616257
darwin-amd64-12_0 0.03138795 175.994555331 25429.052805333 67438.227785737
darwin-amd64-13 0.038021979 4120.744229468 370565.129423636 429429.994737281
darwin-amd64-nocgo 0.031963242 294.516531008 36972.023496533 66837.479161523

Time running tests (make_and_test, seconds):

Before:

Builder p10 p50 p90 p99
darwin-amd64-10_14 1109.337714283 1338.203048197 1438.635656949 1496.908947744
darwin-amd64-10_15 1142.797376231 1345.846128948 1442.504664356 1515.9424281
darwin-amd64-11_0 1321.768913047 1532.862195206 1646.776306831 1738.837314762
darwin-amd64-12_0 1224.030957425 1502.473005047 1706.57681256 1789.397041412
darwin-amd64-nocgo 832.869971342 1125.187291301 1314.866512538 1381.113436051

After:

Builder p10 p50 p90 p99
darwin-amd64-10_14 1026.817217975 1941.302826298 2523.380160588 2841.733411426
darwin-amd64-10_15 1050.692744331 1811.730996635 2426.353868054 2724.949244972
darwin-amd64-11_0 2188.55871373 2517.004398094 2960.169886024 3044.606037487
darwin-amd64-12_0 1949.949176643 2486.049366062 2748.16159951 2832.066993053
darwin-amd64-13 2200.732157566 2933.495288619 3134.77419223 3351.202110408
darwin-amd64-nocgo 1463.312098452 1966.804602269 2125.788129161 2187.764026593

Edit: these are for the Go repo only, not subrepos.

prattmic avatar Jan 12 '23 17:01 prattmic

Should we close this issue and mark it as completed? Any additional problems we find can be addressed in more specific issues if necessary.

cagedmantis avatar Feb 15 '23 21:02 cagedmantis

Closing this since it doesn't seem like there's anything in particular left.

heschi avatar Feb 28 '23 19:02 heschi

Change https://go.dev/cl/484746 mentions this issue: Revert "env/darwin/aws: reduce guest CPU count to 4"

gopherbot avatar Apr 14 '23 20:04 gopherbot