runtime: AVX512 register state clobbered by signal on macOS
What version of Go are you using (go version)?
% go version go version go1.17.2 darwin/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env)?
go env Output
% go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/vsi/Library/Caches/go-build" GOENV="/Users/vsi/Library/Application Support/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/vsi/go/pkg/mod" GONOPROXY="github.com/vsivsi" GONOSUMDB="github.com/vsivsi" GOOS="darwin" GOPATH="/Users/vsi/go" GOPRIVATE="github.com/vsivsi" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/Cellar/go/1.17.2/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/Cellar/go/1.17.2/libexec/pkg/tool/darwin_amd64" GOVCS="" GOVERSION="go1.17.2" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kp/kjdr0ytx5z9djnq4ysl15x0h0000gn/T/go-build4285878374=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
In writing AVX-512 assembly functions using the Go assembler (via Avo), I began to notice unexplained intermittent test failures. A lot of painful investigation revealed that golang async preemption does not seem to be properly save/restoring AVX512 state when assembly functions are interrupted. From visual inspection of the runtime/preempt_amd64.s source, it also seems likely that AVX/AVX2 state (YMM upper 128 bits) may not be properly save/restored either, but I have not encountered/tested that case in my own code.
A complete minimal reproduction illustrating the clobbering of the AVX512 K1 opmask register is here: https://gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
Note, in my testing running with GODEBUG=asyncpreemptoff=1 rescues every failure case I have identified.
What did you expect to see?
I expected all relevant processor state to be properly restored in an assembly function following an async preemption.
What did you see instead?
Critical state is being clobbered, leading to intermittent undefined behavior.
CC @ianlancetaylor, @prattmic, @FiloSottile.
Hmm, this may be the VZEROUPPER we introduced in CL 219131 to fix #37174. Or it may be worse than that, I'm not sure.
@cherrymui
@vsivsi Can you try patching out #ifdef GOOS_darwin clause in runtime/preempt_amd64.s and see if that helps?
The runtime doesn't async preempt assembly functions. asyncPreempt (the function in runtime/preempt_amd64.s) should not be executed if it is running in an assembly function. The runtime sends a signal, but if the signal lands in an assembly function the signal handler will return immediately without actually preempt.
I suspect that the darwin kernel doesn't preserve AVX512 state when handling signals. Then the problem would not be preemption, but any asynchronous signal could cause it to fail, including profiling signals or user signals.
What version of macOS are you running on?
I have a vague memory that Apple suggests that before using AVX512 one must check its availability with some sysctl (but I'm not really sure). Do we do that?
@randall77 I can try that, but the initial issue I encountered (as reflected in the repro case) was clobbering of the K0-K7 opmask registers, which are distinct from the vector registers affected by VZEROUPPER, etc.
@cherrymui Yes, Darwin requires special checking to identify AVX512 support. I submitted a fix for this last year. See here for details on that saga: https://github.com/golang/go/issues/43089
It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?
To maintain compatibility with existing assembly functions, the compiler generates adapter functions that convert between the new register-based calling convention and the previous stack-based calling convention. These adapters are typically invisible to users, except that taking the address of a Go function in assembly code or taking the address of an assembly function in Go code using reflect.ValueOf(fn).Pointer() or unsafe.Pointer will now return the address of the adapter. Code that depends on the value of these code pointers may no longer behave as expected. Adapters also may cause a very small performance overhead in two cases: calling an assembly function indirectly from Go via a func value, and calling Go functions from assembly.
How we detect asm routines during preempt changed in Go 1.18. (https://github.com/golang/go/commit/c2483a5c034152fcdfbb2e6dbcf48b0103d8db6a#diff-47fd68949147d260b91998c0d6eabffcdb74c58991fc386b40e14a6ed710b17d)
Does this reproduce at tip? (I'm AFK.)
It does, with:
% go version
go version devel go1.18-6113dacf32 Sat Oct 30 18:30:34 2021 +0000 darwin/amd64
% sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
% sw_vers
ProductName: macOS
ProductVersion: 11.6
BuildVersion: 20G165
% go test .
--- FAIL: TestMask (0.43s)
maskcheck_test.go:16: Failed for iteration: 0 with return value: 88007197
maskcheck_test.go:16: Failed for iteration: 1 with return value: 61939268
maskcheck_test.go:16: Failed for iteration: 2 with return value: 50093958
maskcheck_test.go:16: Failed for iteration: 3 with return value: 57687622
maskcheck_test.go:16: Failed for iteration: 4 with return value: 55302035
maskcheck_test.go:16: Failed for iteration: 5 with return value: 50302791
maskcheck_test.go:16: Failed for iteration: 6 with return value: 55661062
maskcheck_test.go:16: Failed for iteration: 7 with return value: 54177345
maskcheck_test.go:16: Failed for iteration: 8 with return value: 58187395
maskcheck_test.go:16: Failed for iteration: 9 with return value: 51991146
maskcheck_test.go:16: Failed for iteration: 10 with return value: 56438664
maskcheck_test.go:16: Failed for iteration: 11 with return value: 54676159
maskcheck_test.go:16: Failed for iteration: 12 with return value: 50955817
maskcheck_test.go:16: Failed for iteration: 13 with return value: 52912278
maskcheck_test.go:16: Failed for iteration: 14 with return value: 57776646
maskcheck_test.go:16: Failed for iteration: 15 with return value: 54628233
maskcheck_test.go:16: Failed for iteration: 16 with return value: 55309645
maskcheck_test.go:16: Failed for iteration: 17 with return value: 52719044
maskcheck_test.go:16: Failed for iteration: 18 with return value: 57165517
maskcheck_test.go:16: Failed for iteration: 19 with return value: 57336883
FAIL
FAIL x 0.570s
FAIL
It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?
It seems unlikely. You can try building with Go 1.17 with GOEXPERIMENT=noregabi environment variable set (at go run/build/test time).
Also, you can try running the program with preemption disabled but CPU profiling enabled and see if profiling signal changes anything. You can also try running the program with preemption disabled while having a separate process sending SIGURG signals to it. Thanks.
@cherrymui Good suggestions. Running the repro test case as:
% GODEBUG=asyncpreemptoff=1 go test -cpuprofile cpu.out -count 1 -timeout 15m -run '^TestMask$' gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
--- FAIL: TestMask (0.26s)
maskcheck_test.go:21: Failed for iteration: 0 with return value: 47429074
maskcheck_test.go:21: Failed for iteration: 1 with return value: 45358536
maskcheck_test.go:21: Failed for iteration: 2 with return value: 52675366
maskcheck_test.go:21: Failed for iteration: 3 with return value: 44456167
maskcheck_test.go:21: Failed for iteration: 4 with return value: 72420817
maskcheck_test.go:21: Failed for iteration: 5 with return value: 72654067
maskcheck_test.go:21: Failed for iteration: 6 with return value: 72091434
maskcheck_test.go:21: Failed for iteration: 7 with return value: 73369163
maskcheck_test.go:21: Failed for iteration: 8 with return value: 72828760
maskcheck_test.go:21: Failed for iteration: 9 with return value: 73065114
maskcheck_test.go:21: Failed for iteration: 10 with return value: 73038149
maskcheck_test.go:21: Failed for iteration: 11 with return value: 72684990
maskcheck_test.go:21: Failed for iteration: 12 with return value: 72653599
maskcheck_test.go:21: Failed for iteration: 13 with return value: 72440260
maskcheck_test.go:21: Failed for iteration: 14 with return value: 72201890
maskcheck_test.go:21: Failed for iteration: 15 with return value: 45021003
maskcheck_test.go:21: Failed for iteration: 16 with return value: 72012922
maskcheck_test.go:21: Failed for iteration: 17 with return value: 44692569
maskcheck_test.go:21: Failed for iteration: 18 with return value: 72992403
maskcheck_test.go:21: Failed for iteration: 19 with return value: 73178568
FAIL
FAIL gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32 0.601s
FAIL
This leads to the test failing, and with different looking stats (each iteration fails earlier in the loop on average, probably due to different signaling rates between profiling and preemption).
Two other data points:
- I have not yet observed corruption of actual ZMM vector state in my testing. From what I've seen this appears to be limited to the K0-K7 opmask registers.
- The above observation is curious in light of the fact that the Darwin kernel code I've been looking at is at least not obviously broken w.r.t to handling the opmask registers and AVX512 thread state in signal handling. Opmask data is present in all of the relevant state structures, there appear to be tests that check that it is properly saved and restored, etc. So whatever is happening is more subtle than mere omission of this detail in the kernel. See e.g.: https://github.com/apple/darwin-xnu/search?q=opmask
@vsivsi thanks for the repro! So it seems it is not related preemption but signals in general. The observation about K0-K7 also suggests VZEROUPPER is not the cause.
Would it be possible to write a similar test in C and assembly (GNU syntax) and see if it is failing?
Maybe we need a different CPU feature probe to tell the kernel we're using K0-K7? (I haven't looked at the code carefully. Just a guess.)
@cherrymui The mechanism for using AVX-512 in Darwin doesn't involve telling the kernel which features to enable. The kernel simply advertises the available AVX-512 features in the process commpage data area. Once you've verified that a given required set of AVX-512 features is available, you simply use them. Darwin catches the UD interrupt for any AVX-512 instruction on first use, and promotes that thread to use the full AVX512 thread state, enabling every available feature.
Specifically, the three XCR0[5:7] bits (from XGETBV) indicate whether XSAVE will preserve:
- Registers K0-K7 ("opmask state")
- The upper 256 bits of ZMM0-15 ("ZMM_Hi256 state")
- The full width of ZMM16-31 ("Hi16_ZMM state")
Upon thread promotion following the UD interrupt from a supported AVX512 instruction, Darwin sets XCR0[5:7] = 111b, for the thread, enabling the full "AVX512 state" in XSAVE.
The kernel definitions of these XCR0 bits are here: https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/proc_reg.h#L168-L188
The full AVX512 enabled mask is composed here: https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/fpu.h#L71
And this is the function that runs when a UD for an AVX512 instruction is trapped, triggering the XSAVE state promotion: https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/fpu.c#L1472
@cherrymui Here is an attempted reproduction of this issue in GNU C and Asm. In my testing it does not fail.
https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5
% gcc testmask.c testmask.s && ./a.out &
[1] 27524
% for ((x=0;x<1000000;x++)); do kill -s URG 27524; done
Test Passed!
[1] + done ./a.out
Thanks for trying that. In your C code, could you install a signal handler for SIGURG (which could simply return immediately)? Thanks.
@cherrymui Gist updated with empty SIGURG handler. Now it fails as with go runtime:
% gcc testmask.c testmask.s && ./a.out &
[1] 28786
% for ((x=0;x<20;x++)); do kill -s URG 28786; sleep 0.01; done
Failed for iteration: 133 with return value: 90069936
Failed for iteration: 134 with return value: 79705026
Failed for iteration: 135 with return value: 79757341
Failed for iteration: 136 with return value: 79875856
Failed for iteration: 137 with return value: 82261267
Failed for iteration: 138 with return value: 83127231
Failed for iteration: 139 with return value: 83641375
Failed for iteration: 140 with return value: 83352568
Failed for iteration: 141 with return value: 82555528
Failed for iteration: 142 with return value: 79494222
Failed for iteration: 143 with return value: 78981164
Failed for iteration: 144 with return value: 81261765
Failed for iteration: 145 with return value: 80885813
Failed for iteration: 146 with return value: 79706999
Failed for iteration: 147 with return value: 83441689
Failed for iteration: 148 with return value: 80171870
Failed for iteration: 149 with return value: 79769830
Failed for iteration: 150 with return value: 81361308
Failed for iteration: 151 with return value: 81237987
Failed for iteration: 152 with return value: 79568372
Test failed 20 times!
[1] + done ./a.out
This reminds me of https://github.com/golang/go/issues/37174#issuecomment-584914106 Sounds like Apple needs some more regression tests for their sigaction code path.
I'm sure you all are happy that this doesn't appear to be a problem in golang, but it still seems like a problem for golang, no?
Thanks @vsivsi ! That indicates the kernel's signal handling code doesn't seem to do the right thing.
So what's the next step here? Throwing a process into an indeterminate state via external signal feels like a CVE-level issue.
If there is a simple workaround that we could do it in the runtime, we'll do it, like the VZEROUPPER for #37174 . I don't know how to work around this one, though. One probably has to either not use K0-K7 registers or block all signals.
I guess the next step would be reporting to Apple and have them fix the kernel.
While they're at it, they should fix this one too: https://github.com/golang/go/issues/42649
Also, not using the K0-K7 opmasks == not using AVX-512 for all practical purposes.
We need to report a bug to Apple. I can do that - I've done it before, and supposedly a report from Google has a bit more visibility than one from random internet user. Any idea what range of OS versions this causes a problem on? I don't see a OS version in the OP.
@randall77 I've seen it on Catalina (10.15.x) and Big Sur (11.x), but I bet it goes all the way back to the first AVX-512 support, which I believe was High Sierra (10.13.x) when the first iMac Pro released with a Xeon supporting AVX-512.
As it is not safe to use, maybe we could consider hardwire AVX512 to false on Darwin until the kernel fix?
We could hardwire internal/cpu (and x/sys/cpu) flags to false for avx512. But if the asm uses cpuid itself to figure out what is available, I don't think we'd be able to prevent avx512 use in that case.
The only influence golang has over this is the value set for x/sys/cpu.X86.HasAVX512 which anyone is free to ignore (at their potential peril). But temporarily disabling that won't solve the problem of existing go binaries, or running on unpatched Darwin after Apple releases a fix. The robust solution would be to actually test for the live vuln when setting x/sys/cpu.X86.HasAVX512 at runtime.
Here's a simplified reproduction, just run it and see if it prints anything:
testmask.c
#include <stdio.h>
#include <inttypes.h>
#include <signal.h>
#include <unistd.h>
#include <pthread.h>
volatile int8_t done = 0;
// in testmask.s
uint32_t masktest(uint32_t x);
void sig_handler(int signum) {
// Tell the assembly loop to return.
done = 1;
}
void *tester(void* unused) {
uint32_t x = 0xabcd1234;
uint32_t y = masktest(x);
if (x != y) {
printf("bad: %x -> %x\n", x, y);
}
return NULL;
}
int main(int argc, char *argv[]) {
// Set up simple signal handler.
signal(SIGURG, sig_handler);
// Start worker.
pthread_t worker;
pthread_create(&worker, NULL, tester, NULL);
// Wait until the worker is in its assembly loop.
usleep(1000);
// Send a signal to the worker.
pthread_kill(worker, SIGURG);
// Wait until worker is done.
pthread_join(worker, NULL);
}
testmask.s
.globl _done
.globl _masktest
_masktest:
// Put a value in the k1 avx512 register.
kmovd %edi, %k1
// Wait until we're told to return.
loop:
cmpb $0, _done(%rip)
je loop
// Return the value in the k1 register.
kmovd %k1, %eax
ret
Build with gcc testmask.c testmask.s and run with ./a.out.
Mac OSX bug report here: https://feedbackassistant.apple.com/feedback/9736652 (If you can see it, not sure how they deal with permissions.)
There doesn't seem to be an easy way to let anyone else see my bug report. So here's a screenshot of it:

Change https://golang.org/cl/361255 mentions this issue: cpu: pretend AVX-512 is disabled on Darwin
Posted to Intel software dev forum:
https://community.intel.com/t5/Software-Tuning-Performance/MacOS-Darwin-kernel-bug-clobbers-AVX-512-opmask-register-state/m-p/1327259#M7970