continuous-integration icon indicating copy to clipboard operation
continuous-integration copied to clipboard

macOS upgrade causes some build breakages

Open meteorcloudy opened this issue 9 months ago • 49 comments

We recently updated macOS to 15.3.1 and Xcode 16.1 for all macOS VMs on Bazel CI.

https://buildkite.com/bazel/rules-apple-darwin/builds/10138#01958280-cdc9-4e72-85bb-7399fcdcc056

(01:03:28) ERROR: /Users/buildkite/builds/bk-macos-arm64-nn86/bazel/rules-apple-darwin/test/starlark_tests/targets_under_test/ios/BUILD:129:16: AssetCatalogCompile test/starlark_tests/targets_under_test/ios/app-intermediates/bundle_library_ios.bundle/xcassets failed: (Exit 1): xctoolrunner failed: error executing AssetCatalogCompile command (from target //test/starlark_tests/targets_under_test/ios:app)
  (cd /private/var/tmp/_bazel_buildkite/3ebd711cd99f106e0bfcf0a4dddc286c/execroot/_main && \
  exec env - \
    APPLE_SDK_PLATFORM=iPhoneSimulator \
    APPLE_SDK_VERSION_OVERRIDE=18.2 \
    PATH=/Users/buildkite/Library/Caches/bazelisk/downloads/sha256/ac72ad67f7a8c6b18bf605d8602425182b417de4369715bf89146daf62f7ae48/bin:/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin \
    XCODE_VERSION_OVERRIDE=16.2.0.16C5032a \
  bazel-out/darwin_arm64-opt-exec-ST-d57f47055a04/bin/tools/xctoolrunner/xctoolrunner actool --compile '[ABSOLUTE]bazel-out/ios_sim_arm64-fastbuild-ios-sim_arm64-min12.0-applebin_ios-ST-26d4f6b9029b/bin/test/starlark_tests/targets_under_test/ios/app-intermediates/bundle_library_ios.bundle/xcassets' --platform iphonesimulator --minimum-deployment-target 12.0 --compress-pngs --target-device iphone '[ABSOLUTE]test/starlark_tests/resources/assets.xcassets')
# Configuration: 8931e66e98dfc14da7a42f9369975f695461c8159360912aca94c6ca1a76bc52
# Execution platform: @@platforms//host:host
/Users/buildkite/builds/bk-macos-arm64-nn86/bazel/rules-apple-darwin/test/starlark_tests/resources/assets.xcassets: error: No simulator runtime version from [<DVTBuildVersion 21F79>, <DVTBuildVersion 22D8075>] available to use with iphonesimulator SDK version <DVTBuildVersion 22C146>

meteorcloudy avatar Mar 11 '25 14:03 meteorcloudy

@fweikert @brentleyjones

meteorcloudy avatar Mar 11 '25 14:03 meteorcloudy

The tests request Xcode 15.4, which is no longer installed on our MacOS workers. That's why CI bumps the version ("Fixed Xcode version: 15.4 -> 16.2..."), which might cause the problem. @brentleyjones is there something we need to do here, or is it a fix in rules_apple?

fweikert avatar Mar 11 '25 16:03 fweikert

Even if using Xcode 16.2, we have the simulator runtime issue. @aaronsky has some experience with that and can give more context.

brentleyjones avatar Mar 11 '25 17:03 brentleyjones

For the period we're using Xcode 16.2 on the CI nodes, xcodebuild -downloadPlatform iOS will install the iOS 18.3.1 simulator runtime by default, rather than the originally distributed iOS 18.2. The rules expect by default that the SDK version (18.2) match the simulator runtime version (now 18.3, represented sometimes as 18.3.1). The best thing to do in the short term would be to add a line like this somewhere in the VM/image setup:

# ...
# somewhere after running `xcodebuild -runFirstLaunch` and `xcodebuild -downloadPlatform iOS`

xcode_short_version="$(xcodebuild -version | head -n1 | cut -d' ' -f2 | cut -d. -f1,2)"
if [ "$xcode_short_version" = "16.2" ]; then
    xcodebuild -downloadPlatform iOS -buildVersion 18.2
fi

# ...
# xcodebuild -checkFirstLaunchStatus

This was the best workaround I could find in my own environment to make Bazel, simulator_creator.py, and xcodebuild happy. I wouldn't recommend running this on every CI job, as -downloadPlatform is very slow and needs some time to finish mounting the simulator runtime disk image after it's been installed.

aaronsky avatar Mar 11 '25 20:03 aaronsky

@aaronsky Where is this code? Can we fix the code to not make the assumption?

meteorcloudy avatar Mar 12 '25 10:03 meteorcloudy

The code where this all breaks down is in these spots:

  • https://github.com/bazelbuild/rules_apple/blob/master/apple/testing/default_runner/ios_xctestrun_runner.bzl#L53
  • https://github.com/bazelbuild/rules_apple/blob/master/apple/testing/default_runner/simulator_creator.py#L85
  • (and to a lesser extent) https://github.com/bazelbuild/rules_apple/blob/master/apple/testing/default_runner/ios_test_runner.bzl#L45

When I examined this a couple weeks ago I couldn't figure out another sensible default that could be used to replace the assumption without requiring a new field or some new functionality in xcode_version. Making os_version required on the test runner would break building across different Xcode versions.

aaronsky avatar Mar 12 '25 10:03 aaronsky

I'm mostly worried that the next time we upgrade macOS this will happen again, is there a long term solution for this you have in mind?

meteorcloudy avatar Mar 12 '25 10:03 meteorcloudy

That's a reasonable concern, and no, I don't have a long-term plan in the event Apple drops a new runtime on us out of the blue again. I agree it needs a more robust solution beyond this workaround.

aaronsky avatar Mar 12 '25 10:03 aaronsky

We added

xcode_short_version="$(xcodebuild -version | head -n1 | cut -d' ' -f2 | cut -d. -f1,2)"
if [ "$xcode_short_version" = "16.2" ]; then
    xcodebuild -downloadPlatform iOS -buildVersion 18.2
fi

to our setup script and updated the VMs, can you please verify if it works now?

meteorcloudy avatar Mar 17 '25 14:03 meteorcloudy

Looks like rules_apple is still red: https://buildkite.com/bazel/rules-apple-darwin/builds/10157

@aaronsky Can you try to fix this from rules_apple side? I don't know what else we could do on the infra side and building and deploying a new VIM image is not very trivial.

meteorcloudy avatar Mar 19 '25 16:03 meteorcloudy

There seems to still be a mismatch related to the installed simulator runtimes, but this time with the visionOS runtime. Which, as far as I'm aware, haven't received an update recently. I don't recognize runtime 22N895, but 21O5565d is the SDK that shipped alongside Xcode 15.4, and 22N799 is the SDK in Xcode 16.2.

aaronsky avatar Mar 20 '25 09:03 aaronsky

@aaronsky I believe this has to be fixed from rules_apple side, so I'm closing this now.

meteorcloudy avatar Mar 27 '25 12:03 meteorcloudy

@meteorcloudy while I work on this from the rules_apple side, there is at least one other thing I need done on the macOS image (if you wouldn't mind reopening this issue). It appears as though the xros1.2 simulator runtime is still installed on the image, and it's confusing Xcode. Can we please see about removing the xros1.2 runtime and keeping xros2.2?

Alternatively, we could use xcrun simctl runtime match set to forcefully map SDKs to simulator runtimes. But neither of these two things are things we can do from rules_apple on bazelci.

aaronsky avatar Mar 29 '25 11:03 aaronsky

The shell_commands in https://github.com/bazelbuild/rules_apple/pull/2677 (sorry for misusing them) shows that the installed/configured Xcode 16.2 is definitely confused about how it's selecting the underlying sim runtime. The command ran:

APPLE_SDK_PLATFORM=XRSimulator APPLE_SDK_VERSION_OVERRIDE=2.2 xcrun actool --compile 'doc' --platform xrsimulator --minimum-deployment-target 1.0 --compress-pngs --target-device vision 'examples/resources/VisionAppIcon.xcassets'

The output (matching what rules_apple tests are showing):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.actool.compilation-results</key>
	<dict>
		<key>output-files</key>
		<array/>
	</dict>
	<key>com.apple.actool.errors</key>
	<array>
		<dict>
			<key>description</key>
			<string>No simulator runtime version from [&lt;DVTBuildVersion 21O5565d&gt;, &lt;DVTBuildVersion 22N895&gt;] available to use with xrsimulator SDK version &lt;DVTBuildVersion 22N799&gt;</string>
		</dict>
	</array>
</dict>
</plist>

This should just use xrOS 2.3 (22N895). This command works for me locally with just 22N895 present in xcrun simctl runtime list.

I think this should be reopened and a simple actool command like that should be confirmed to succeed before it's considered closed.

mattrobmattrob avatar Mar 29 '25 22:03 mattrobmattrob

Anything we can do to help get this fixed on the macOS runners? rules_apple is exploring fairly significant workarounds but correcting Xcode/the machines would be drastically better for the maintainers.

mattrobmattrob avatar Apr 02 '25 14:04 mattrobmattrob

@fweikert Can you take a look?

meteorcloudy avatar Apr 02 '25 14:04 meteorcloudy

I upgraded my VM to 15.4, and the command succeeds even though xrOS 1.2 is present:

== Disk Images ==
-- iOS --
iOS 18.3.1 (22D8075) - 819CE51A-FB18-412A-B149-654606CA3742 (Ready)
iOS 18.2 (22C150) - E22DA566-DE6E-440A-B6AA-C04F42E9A284 (Ready)
iOS 17.5 (21F79) - 2AED0805-8F1F-419F-9E02-5B38734BDD31 (Ready)
-- tvOS --
tvOS 18.2 (22K154) - A7AED462-E128-4A8C-9216-ABC2F4430ADF (Ready)
tvOS 17.5 (21L569) - CA2800E5-0E8B-41F4-BE18-EFCBB2B0509F (Ready)
-- watchOS --
watchOS 11.2 (22S99) - D0272B69-66EF-43E7-9421-160EFF5F6D59 (Ready)
watchOS 10.5 (21T575) - 8A918DA9-96C6-40F5-A2CC-3E46144AE937 (Ready)
-- xrOS --
xrOS 1.2 (21O5565d) - 0D8411AF-B9C3-4BB7-AA33-822A029A5A36 (Ready)
xrOS 2.3 (22N895) - 0B4E0E33-96F0-458B-871F-A6A3BEB6A559 (Ready)

Total Disk Images: 9 (53.1G)
$ APPLE_SDK_PLATFORM=XRSimulator APPLE_SDK_VERSION_OVERRIDE=2.2 xcrun actool --compile 'doc' --platform xrsimulator --minimum-deployment-target 1.0 --compress-pngs --target-device vision 'examples/resources/VisionAppIcon.xcassets'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.actool.compilation-results</key>
	<dict>
		<key>output-files</key>
		<array>
			<string>/Users/ci/fwe_test/rules_apple/doc/Assets.car</string>
		</array>
	</dict>
</dict>
</plist>

Nevertheless, I removed the old runtime via xcrun simctl runtime delete 21O5565d. I'll test the new image in QA soon.

fweikert avatar Apr 03 '25 15:04 fweikert

Amazing, thank you, @fweikert! I don't think you necessarily need to remove 21O5565d but rather clear out any Xcode internal references to 22N799. But whatever is easiest for you all will be great, thanks.

mattrobmattrob avatar Apr 03 '25 16:04 mattrobmattrob

Even with only 2.3 it's still failing. Where does the reference to 2.2 (22N799) come from?

fweikert avatar Apr 04 '25 14:04 fweikert

My guess is that it's some internal state within Xcode. Perhaps retained from a previous version of Xcode before upgrading to the one associated with 22N895. I haven't tried to narrow it down much though. Hard to do without poking around on the VMs.

But that's what I was trying to convey, it's not the presence of the two in the list that are the problem. It's that Xcode thinks it can access something that doesn't actually exist and therefore the request SDK version maps to something that doesn't exist.

mattrobmattrob avatar Apr 04 '25 14:04 mattrobmattrob

I'm not that familiar with Xcode, so I'll do some more digging.

It's already interesting that system_profiler -json SPDeveloperToolsDataType shows visionOS 2.2, whereas xcrun simctl runtime list only shows 2.3

fweikert avatar Apr 04 '25 18:04 fweikert

I have no idea whats going on. Xcode > Components shows "visionOS 2.2 (22N799) SDK + visionOS 2.3 (22N85) SImulator", and I have no option to install anything else.

fweikert avatar Apr 07 '25 13:04 fweikert

I'm still looking at this off and on. at this point using updating to 16.3 might sidestep some pieces ¯\(ツ)

are these vms heavily resource constrained? these actions should be sub-second:


(01:21:45) [12,677 / 17,352] 6 actions running
--
  | AssetCatalogCompile .../ios/static_framework_with_transitive_resources-intermediates/xcassets; 255s local, remote-cache
  | //test/starlark_tests/targets_under_test/ios:static_framework_with_transitive_resources; 250s local, remote-cache
  | AssetCatalogCompile test/starlark_tests/targets_under_test/visionos/app-intermediates/xcassets; 247s local, remote-cache
  | ProcessEntitlementsFiles .../targets_under_test/watchos/single_target_app_entitlements.entitlements; 192s remote-cache, darwin-sandbox
  | ProcessEntitlementsFiles .../watchos/ios_watchos_with_watchos_extension_entitlements.entitlements; 166s remote-cache, darwin-sandbox

I'm seeing this pretty regularly trying to debug this issue

keith avatar Apr 17 '25 23:04 keith

This also happens for actions that aren't part of this repo:

[230 / 230] no actions running
  | Fetching repository @@bazel_tools+xcode_configure_extension+local_config_xcode; Building xcode-locator 293s
  | Fetching repository @@apple_support++apple_cc_configure_extension+local_config_apple_cc; starting 137s
  |  

which likely affects other project's CI too

keith avatar Apr 17 '25 23:04 keith

Here's a set of jobs I cancelled while these hung https://buildkite.com/bazel/rules-apple-darwin/builds/10260#01964606-787b-4128-81bd-4dfef3218688

it's possible there's a GUI prompt related to xcode that's just hanging forever

keith avatar Apr 17 '25 23:04 keith

here's a job that spent >3 minutes cloning the repo https://buildkite.com/bazel/rules-apple-darwin/builds/10261#01964613-6d3e-423b-87b6-fbc49dedbb64

seems like something fishy is going on since this repo is very small

keith avatar Apr 17 '25 23:04 keith

I applied some workarounds here https://github.com/bazelbuild/rules_apple/pull/2679/, notably for this thread i had to delete simulators and setup the visionOS simulator manually. I think this is fine and I can drop it when Xcode is upgraded again.

I'm still interested in the performance aspects mentioned above ^

keith avatar Apr 19 '25 17:04 keith

for a data point on perf, here's a scheduled job that ran all the tests (no caching) before this update in 25 minutes https://buildkite.com/bazel/rules-apple-darwin/builds/10108#01955d8b-1375-4b44-bc54-731e6d05524f

my green build before merging these workarounds took >1 hour https://buildkite.com/bazel/rules-apple-darwin/builds/10292#_

my workarounds include limiting --jobs and related flags so it's not entirely fair, but I found not doing that to just time out instead

keith avatar Apr 19 '25 19:04 keith

@meteorcloudy @fweikert Any update on this? rules_apple CI is unbearable now, and tests regularly time out.

brentleyjones avatar Apr 23 '25 17:04 brentleyjones

The Mac machines are very resource constraint, so running multiple large integration tests in parallel will likely cause tests to be flaky or timeout. We had to limit local test job number for bazel to 2: https://github.com/bazelbuild/bazel/blob/ba6f6f7ca8c9e377afdf22d05a8860d2b1adbc20/.bazelci/presubmit.yml#L207-L208

meteorcloudy avatar Apr 24 '25 10:04 meteorcloudy