Investigation into the mingw-w64 AVX AND AVX2 misalignment bug.
Hello.
Please upgrade the following setups from msys2 home page:
msys2-i686-20161025.exe
msys2-x86_64-20161025.exe
This will prevent beginners from thinking that the project was left in 2016.
Thank you!
Yes, it's a rolling distribution with new package builds coming out all of the time.
@RoyiAvital you can get more recent setup here https://sourceforge.net/projects/msys2/files/Base/
For instance, when will be the AVX and AVX2 code on Windows 64 Bit issue be resolved?
What issue are you talking about? Please provide a link.
You can compile software that uses AVX and AVX2 quite happily using our compilers. Our prebuilt binaries cannot in general use AVX and AVX2 because our end-user's machines may not support those CPU features and that would result in crashes due to illegal instructions. This is exactly the same thing that every software distribution must contend with and we are no worse.
Having said that some of the better written software out there (such as OpenBLAS) that we provide pre-built packages for implement runtime CPU detection and dispatch and this software will take good advantage of AVX and AVX2 when the machine supports it.
In many general purpose C and C++ library cases AVX and AVX2 do not provide much very speed-up anyway. These features work best when dealing with things like heavy matrix computation and also with hand-crafted assembly language (or at least compiler intrinsics).
@mingwandroid , The problem is Windows 64 Bit requires 32 Byte Alignment for AVX & AVX2. I'm really not an expert on this but since GCC (As utilized by MinGW64 and MSYS2) aligns with 16 Byte it seems the code isn't compatible with Windows 64.
I'm talking about code of the user. not libraries supplied with MinGW64 or MSYS2.
References:
- minigw-w64 is incapable of 32 byte stack alignment, easy work around or switch compilers?.
- How to align stack at 32 byte boundary in GCC?.
- Sleef - Can't produce AVX & AVX2 Code in Windows 64 Bit due to ABI Issues in GCC / MinGW64.
- MinGW64 Discussion Board - [Mingw-w64-public] AVX support is broken in 64-bit mode! Will there ever be a fix?.
- Wrapper for
__m256Producing Segmentation Fault with Constructor - Windows 64 + MinGW + AVX Issues.
What I'm asking is 2 things:
- Is this bug relevant even in MinGW64 based on GCC 7.x?
- If it does relevant, anyone working on fixing it?
For instance, when will be the AVX and AVX2 code on Windows 64 Bit issue be resolved?
@RoyiAvital,
This is clearly a GCC issue and not an MSYS2 issue, so you should ask GCC or mingw-w64 about that. We patch a few things in GCC but do not tend to fix things that are this low-level.
Is it really so difficult to determine the correct place to report issues to?
What I'm asking is 2 things: Is this bug relevant even in MinGW64 based on GCC 7.x?
And you are asking questions that you should try to determine the answers to yourself, reproduction cases would be useful for someone with time to look into this.
Here is the simplest code to reproduce it:
https://stackoverflow.com/questions/30926241
I'm sorry to post here. My question regarding the MSYS2 is simple and relevant - Does MSYS2 allows generating AVX & AVX2 code which is compatible with Windows 64?
Here is the simplest code to reproduce it:
Thank you. But this example does not work out of the box according to the comments. You should modify it and paste it here instead of expecting others to do this work for you.
My question regarding the MSYS2 is simple and relevant - Does MSYS2 allows generating AVX & AVX2 code which is compatible with Windows 64?
As I already explained, I do not think anyone from MSYS2 will have the time to look into this and we do not tend to fix such low level bugs in mingw-w64/GCC which is where this bug is located.
.. and that's OK because it's a mingw-w64/GCC bug.
This seems to be a code which creates the issue:
#include <immintrin.h>
void foo(__m256 x) {}
int main()
{
__m256 r = _mm256_set1_ps(0.0f);
foo(r);
}
Compiling with -mavx.
I think it is simple.
FWIW, all of this is very much unrelated to the original issue, while this thread now is hijacked for a completely different matter.
As mentioned in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412, clang and MSVC don't seem to have the same issue. An adjusted version of the example code, avxalign.c:
#include <immintrin.h>
void foo(__m256 x);
void func(void)
{
__m256 r = _mm256_set1_ps(0.0f);
foo(r);
}
If compiled with GCC:
$ x86_64-w64-mingw32-gcc -S -O2 -mavx avxalign.c -o -
func:
subq $72, %rsp
.seh_stackalloc 72
.seh_endprologue
vxorps %xmm0, %xmm0, %xmm0
leaq 32(%rsp), %rcx
vmovaps %ymm0, 32(%rsp)
vzeroupper
call foo
nop
addq $72, %rsp
ret
This doesn't align the pointer where the argument is stored, and writes into it with vmovaps.
With clang:
$ clang -target x86_64-w64-mingw32 -S -O2 -mavx avxalign.c -o -
func: # @func
.seh_proc func
pushq %rbp
.seh_pushreg 5
subq $80, %rsp
.seh_stackalloc 80
leaq 80(%rsp), %rbp
.seh_setframe 5, 80
.seh_endprologue
andq $-32, %rsp
vxorps %xmm0, %xmm0, %xmm0
vmovaps %ymm0, 32(%rsp)
leaq 32(%rsp), %rcx
vzeroupper
callq foo
nop
movq %rbp, %rsp
popq %rbp
retq
This overallocates stack space in order to be able to align it, and then writes into it with an aligned write.
If not using SEH, by adding -fdwarf-exceptions, it produces different code that also does the alignment:
$ clang -target x86_64-w64-mingw32 -S -O2 -mavx avxalign.c -o - -fdwarf-exceptions
func: # @func
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
andq $-32, %rsp
subq $96, %rsp
vxorps %xmm0, %xmm0, %xmm0
vmovaps %ymm0, 32(%rsp)
leaq 32(%rsp), %rcx
vzeroupper
callq foo
movq %rbp, %rsp
popq %rbp
retq
This is what MSVC produces:
$ cl -c -O2 avxalign.c
$ x86_64-w64-mingw32-objdump -d avxalign.obj
0000000000000000 <func>:
0: 40 55 rex push %rbp
2: 48 83 ec 60 sub $0x60,%rsp
6: 48 8d 6c 24 40 lea 0x40(%rsp),%rbp
b: 48 83 e5 e0 and $0xffffffffffffffe0,%rbp
f: c5 fc 10 05 00 00 00 vmovups 0x0(%rip),%ymm0 # 17 <func+0x17>
16: 00
17: c5 fc 11 45 00 vmovups %ymm0,0x0(%rbp)
1c: 48 8d 4d 00 lea 0x0(%rbp),%rcx
20: c5 f8 77 vzeroupper
23: e8 00 00 00 00 callq 28 <func+0x28>
28: 48 83 c4 60 add $0x60,%rsp
2c: 5d pop %rbp
2d: c3 retq
This both aligns the pointer, and uses unaligned stores to write it onto the stack
However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC:
$ cat avxalign2.c
#include <immintrin.h>
void foo(__m256 *x);
void func(void)
{
__m256 r = _mm256_set1_ps(0.0f);
foo(&r);
}
$ x86_64-w64-mingw32-gcc -S -O2 -mavx avxalign2.c -o -
func:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $32, %rsp
.seh_stackalloc 32
.seh_endprologue
vxorps %xmm0, %xmm0, %xmm0
subq $64, %rsp
leaq 63(%rsp), %rcx
andq $-32, %rcx
vmovaps %ymm0, (%rcx)
vzeroupper
call foo
nop
movq %rbp, %rsp
popq %rbp
ret
Here the local variable is properly aligned.
@mstorsjo , Great analysis! It means that as long as variables are passed by reference the code generated will work.
The question is, how can it be fixed? Where the right place to post your analysis so people will fix it?
Thank You.
The GCC bug report that I linked is the relevant one.
Do you have account there to link to your analysis (Which states only variables transferred by value are an issue)?
It seems one must have a special granted account to post there which I don't have. It seems a gut from Folding@Home is even asking for this.
Thank You.
I've retitled the bug since it went completely tangential immediately (2nd and 4th comment).
Do you have account there to link to your analysis (Which states only variables transferred by value are an issue)?
I'm not sure if I have a GCC bugzilla account - anyone more affected by the issue than me can take it forward, I just gave the issue a brief look from the clang perspective.
@mstorsjo ,
Could you tell how did you compile the above (Clang and GCC)?
Version and distribution of MinGW64 / MSYS2?
What version of GCC / CLang?
Is there a MinGW64 target for Clang (Something like -target x86_64-w64-mingw64?
Thank You.
@RoyiAvital it you look at the beginning of each snippet you can see used command.
@mati865 , I saw those. They show the command used for compilation, not the environment (Compiler Version, MSYS / CYGWIN / MinGW / MinGW64 version, etc...).
On a side note (for my own knowledge) I wanted to ask what other option Clang has for it <sys> options.
Above @mstorsjo used -target x86_64-w64-mingw32 I wonder if -target x86_64-w64-mingw64 makes any difference (If exists)?
What other option are there beside mingw32 For 64 Bit Windows (The <sys> in -target x86_64-w64-<sys>).
The environment used in the examples does not matter as it doesn't include anything or link anything - the command line contains everything needed. I tested with a recent clang svn version (the latest from about a day ago), but I'm pretty sure at least the last couple releases should behave the same.
There's no -mingw64 target, but -msvc and a maybe -cygwin.
OK. Though I still think the GCC version could be important, no?
Just to add information, I think this comment is important - https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64?noredirect=1#comment86499640_30928265.
Sorry for digging up this old thread, but hey I bumped into it!
@RoyiAvital: Though I still think the GCC version could be important, no?
It's no longer important now. My own tests plus 54412 seems to suggest that it happens on all SEH-enabled versions of GCC, which is everything after something like 4.8.9.
If it does relevant, anyone working on fixing it?
The GCC people are (understandably) looking to do a proper fix to make the compiler actually understand how alignments work, but they seem a bit stuck.
I have hacked together a very ugly patch that basically makes every place that generates aligned load/stores use unaligned instructions instead. I remember seeing someone say on Nehalem and newer the peanlty is very small or something.
0001-Force-use-unaligned-insns-as-49001-workaround.patch.txt
I will try to send it to makepkg to build and test it if possible, but in case I burn out before then you all know what to try. Heck, I still owe Alex & the cygwin people a cmdline parser...
A more graceful way to do the workaround may be changing as to add a flag for doing this sort of replacement. But come on, a hack is a hack.
Hi all!
I have a feeling that GCC 11.2.0 has this issue fixed. At least I cannot reproduce the crash in a trivial test. I have managed to compile packages. If you would like to test them, please fetch here: https://yadi.sk/d/rL4Lo6HFkAojPA
I will upload the sources of the script a bit later today (I had to change the patchset and have a problem with updating checksums)
UPD:
Here is a commit that makes it possible to compile GCC 11.2.0 for MinGW64: https://github.com/dimula73/MINGW-packages/commit/7065447358e9d98419482e1a00f0e896f72b784b
I'm not sure I know how MSYS handles multiple compilers at the same time, so I'm a bit of scared to propose a PR for that :)
I don't think it fixes the issue -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=28103 from bug 54412 still segfaults.
PS: Hmm, any idea why the MSYS2 GNU tar is not recognizing the .zst file extension even when zstd is installed? I really thought they would've, you know, updated for that. Well anyways pacman -U works.
I don't think it fixes the issue -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=28103 from bug 54412 still segfaults.
Hm... then it is just a coincidence that my local test passes :(
However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC:
that's exactly the issue I'm seeing now: passing by value triggers segfault when compiled for haswell/skylake