GHA: Promote macOS-arm64 cross-compilation job to full native job
Using the new CI runners (awesome performance!).
2 remaining failures:
std.internal.math.gammafunctionunittests, with enabled optimizations only- lit-test
driver/config_diag.d
These work for Cirrus CI, on macOS 12 (not 14, and surely a different Xcode version too).
2 remaining failures:
std.internal.math.gammafunctionunittests, with enabled optimizations only- lit-test
driver/config_diag.dThese work for Cirrus CI, on macOS 12 (not 14, and surely a different Xcode version too).
Hmm, some strange miscompile somehow?
lit-test driver/config_diag.d works for me, macOS 14.2.1, LLVM 17, Apple clang 15.0.0.
And the Phobos failure also works locally for me:
❯ bin/ldc2 -O -main -unittest -run ../ldc/runtime/phobos/std/internal/math/gammafunction.d
1 modules passed unittests
Before merging this PR, I think I should download the artifacts (that's possible right?) and compare the output of the gamma unittest with my local build, and see if I can figure out what the miscompile is. Otherwise, I fear we release with a somehow miscompiling compiler...
I downloaded the osx-universal artifact:
bin/ldc2 -O -main -unittest -run import/std/internal/math/gammafunction.dPasses fine. Should I be running something else?bin/ldc2 -conf=/Users/johan/ldc/ldc/tests/driver/inputs/noswitches.confreproduces (it works with other LDC, but crashes with the artifact ldc)
About the bin/ldc2 -conf=/Users/johan/ldc/ldc/tests/driver/inputs/noswitches.conf failure. I may have found some hints:
- it fails while throwing an exception. We intend to throw (and catch) the exception, that is exactly what the test is testing (
throw new Exception("Could not look up switches in " ~ cast(string) dCfPath);). - after some searching I think it is the only case where we throw an exception in the compiler. What I mean is: I think it is the only CI test where inside the compiler an exception is thrown.
- when loading the
ldc2binary intolldband running the test with-conf=, this is the output:
(lldb) run -conf=/Users/johan/ldc/ldc/tests/driver/inputs/noswitches.conf
Process 5078 launched: '/Users/johan/ldc/test_gha/ldc2-ce3f8516-osx-universal/bin/ldc2' (arm64)
Process 5078 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000195ebddb4
libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(libunwind::LocalAddressSpace&,
unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info*,
libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info*, bool) + 48
libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE:
- when trying to check the backtrace (
bt), lldb outputs a ton of these errors
(lldb) bt
error: unable to find CIE at 0x33bbc for cie_id = 0xfffd1888 for entry at 0x5440.
error: unable to find CIE at 0x3887c for cie_id = 0xfffcf588 for entry at 0x7e00.
error: unable to find CIE at 0x10f38 for cie_id = 0xffff843c for entry at 0x9370.
error: unable to find CIE at 0x174b4 for cie_id = 0xffff2930 for entry at 0x9de0.
error: unable to find CIE at 0x3c990 for cie_id = 0xfffcd9e4 for entry at 0xa370.
- CIE is broken? https://stackoverflow.com/questions/23914453/lldb-unable-to-find-cie "CIE means Common Information Entry and is related to the Dwarf debug format."
The gammafunction module consistently fails on the new M1 GHA runners for the vanilla-LLVM jobs too, using vanilla LLVM 16 & 17. The config_diag.d lit-test works there though (different LLVM, no assertions, different host compiler, no LTO, no PGO...).
@JohanEngelen: So wrt. gammafunction, I'd expect you to see it too, with the regular Phobos unittest runner. - Wrt. the thrown exception for the extra .conf, I'm wondering how the current CI artifacts behave (cross-compiled, but still with PGO + LTO + mimalloc IIRC). [And note that we don't compile the CI artifacts with -g, otherwise they'd be huge.]
[Draft because of random Pure virtual function called! errors (at compiler runtime) in first experiments in #4604, very roughly 0-5 per CI run.] The previous issues are resolved by now.
The situation hasn't changed with latest LLVM v18.1.5 and the latest GHA macos-14 image. I've retried the CI job 2 times; the first 2 runs were green, the third now had one failure again:
/Users/runner/work/ldc/build/bin/ldmd2 -conf= -m64 -Irunnable -mcpu=native -g -link-defaultlib-debug -od../../../build/dmd-testsuite-debug/runnable -of../../../build/dmd-testsuite-debug/runnable/mars1_0 runnable/mars1.d
libc++abi: Pure virtual function called!
Error: Error executing /Users/runner/work/ldc/build/bin/ldc2: Abort trap: 6
I went an extra mile of building the LLVM package on macos-14 with the oldest available Xcode v14.3.1, and switching to that Xcode and LLVM package here. No improvements - still sporadic 'pure virtual function called' crashes. swearing
Oh man, this keeps getting weirder and more ball-busting. So I've now tried switching back to the old LDC-LLVM v17.0.6 package, which was cross-compiled on macos-12 (edit: or more likely even v11) at the time; I don't recall which Xcode version, but at most v14. While using Xcode v14.3.1 for LDC here.
Results for 2 CI pipelines/workflow runs, with 4 native macos arm64 jobs each, 2 without PGO and only D-LTO, and 2 with PGO plus full LTO (incl. C++ parts - the former 'unsupported stack probe' error vanished!):
- The 4 overall jobs with PGO + full LTO never encountered the 'pure virtual function called' errors so far; only failed sporadically for
core.thread.fiberwith enabled optimizations (=> https://github.com/ldc-developers/ldc/pull/4648). - The other 4 jobs without PGO and only D-limited LTO encountered at least 1 'pure virtual function called' error every time.
So it looks as if PGO and/or full LTO might fix this abomination for that new combination of prebuilt LLVM and Xcode - the options that with LLVM 18 and latest Xcode v15.3 led to more failures! I'm gonna re-run the workflow some more times to see if this really holds.
Same results after 5 workflow runs, i.e., 10 jobs each - at least 1 pure-virtual-func error for all 10 jobs without PGO and D-limited LTO, not a single one for the 10 jobs with PGO + full LTO.