MSYS2-packages icon indicating copy to clipboard operation
MSYS2-packages copied to clipboard

Investigation into the mingw-w64 AVX AND AVX2 misalignment bug.

Open silvioprog opened this issue 7 years ago • 28 comments

Hello.

Please upgrade the following setups from msys2 home page:

msys2-i686-20161025.exe
msys2-x86_64-20161025.exe

This will prevent beginners from thinking that the project was left in 2016.

Thank you!

silvioprog avatar Mar 27 '18 13:03 silvioprog

Yes, it's a rolling distribution with new package builds coming out all of the time.

mingwandroid avatar Apr 07 '18 23:04 mingwandroid

@RoyiAvital you can get more recent setup here https://sourceforge.net/projects/msys2/files/Base/

Alexpux avatar Apr 08 '18 05:04 Alexpux

For instance, when will be the AVX and AVX2 code on Windows 64 Bit issue be resolved?

What issue are you talking about? Please provide a link.

You can compile software that uses AVX and AVX2 quite happily using our compilers. Our prebuilt binaries cannot in general use AVX and AVX2 because our end-user's machines may not support those CPU features and that would result in crashes due to illegal instructions. This is exactly the same thing that every software distribution must contend with and we are no worse.

Having said that some of the better written software out there (such as OpenBLAS) that we provide pre-built packages for implement runtime CPU detection and dispatch and this software will take good advantage of AVX and AVX2 when the machine supports it.

In many general purpose C and C++ library cases AVX and AVX2 do not provide much very speed-up anyway. These features work best when dealing with things like heavy matrix computation and also with hand-crafted assembly language (or at least compiler intrinsics).

mingwandroid avatar Apr 08 '18 10:04 mingwandroid

@mingwandroid , The problem is Windows 64 Bit requires 32 Byte Alignment for AVX & AVX2. I'm really not an expert on this but since GCC (As utilized by MinGW64 and MSYS2) aligns with 16 Byte it seems the code isn't compatible with Windows 64.

I'm talking about code of the user. not libraries supplied with MinGW64 or MSYS2.

References:

  1. minigw-w64 is incapable of 32 byte stack alignment, easy work around or switch compilers?.
  2. How to align stack at 32 byte boundary in GCC?.
  3. Sleef - Can't produce AVX & AVX2 Code in Windows 64 Bit due to ABI Issues in GCC / MinGW64.
  4. MinGW64 Discussion Board - [Mingw-w64-public] AVX support is broken in 64-bit mode! Will there ever be a fix?.
  5. Wrapper for __m256 Producing Segmentation Fault with Constructor - Windows 64 + MinGW + AVX Issues.

What I'm asking is 2 things:

  1. Is this bug relevant even in MinGW64 based on GCC 7.x?
  2. If it does relevant, anyone working on fixing it?

RoyiAvital avatar Apr 08 '18 12:04 RoyiAvital

For instance, when will be the AVX and AVX2 code on Windows 64 Bit issue be resolved?

@RoyiAvital,

This is clearly a GCC issue and not an MSYS2 issue, so you should ask GCC or mingw-w64 about that. We patch a few things in GCC but do not tend to fix things that are this low-level.

Is it really so difficult to determine the correct place to report issues to?

mingwandroid avatar Apr 08 '18 13:04 mingwandroid

What I'm asking is 2 things: Is this bug relevant even in MinGW64 based on GCC 7.x?

And you are asking questions that you should try to determine the answers to yourself, reproduction cases would be useful for someone with time to look into this.

mingwandroid avatar Apr 08 '18 13:04 mingwandroid

Here is the simplest code to reproduce it:

https://stackoverflow.com/questions/30926241

I'm sorry to post here. My question regarding the MSYS2 is simple and relevant - Does MSYS2 allows generating AVX & AVX2 code which is compatible with Windows 64?

RoyiAvital avatar Apr 08 '18 13:04 RoyiAvital

Here is the simplest code to reproduce it:

Thank you. But this example does not work out of the box according to the comments. You should modify it and paste it here instead of expecting others to do this work for you.

My question regarding the MSYS2 is simple and relevant - Does MSYS2 allows generating AVX & AVX2 code which is compatible with Windows 64?

As I already explained, I do not think anyone from MSYS2 will have the time to look into this and we do not tend to fix such low level bugs in mingw-w64/GCC which is where this bug is located.

.. and that's OK because it's a mingw-w64/GCC bug.

mingwandroid avatar Apr 08 '18 13:04 mingwandroid

This seems to be a code which creates the issue:

#include <immintrin.h>

void foo(__m256 x) {}

int main()
{
    __m256 r = _mm256_set1_ps(0.0f);
    foo(r);
}

Compiling with -mavx. I think it is simple.

RoyiAvital avatar Apr 08 '18 13:04 RoyiAvital

FWIW, all of this is very much unrelated to the original issue, while this thread now is hijacked for a completely different matter.

As mentioned in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412, clang and MSVC don't seem to have the same issue. An adjusted version of the example code, avxalign.c:

#include <immintrin.h>

void foo(__m256 x);

void func(void)
{
    __m256 r = _mm256_set1_ps(0.0f);
    foo(r);
}

If compiled with GCC:

$ x86_64-w64-mingw32-gcc -S -O2 -mavx avxalign.c -o -
func:   
        subq    $72, %rsp
        .seh_stackalloc 72
        .seh_endprologue
        vxorps  %xmm0, %xmm0, %xmm0
        leaq    32(%rsp), %rcx
        vmovaps %ymm0, 32(%rsp)
        vzeroupper
        call    foo
        nop
        addq    $72, %rsp
        ret

This doesn't align the pointer where the argument is stored, and writes into it with vmovaps.

With clang:

$ clang -target x86_64-w64-mingw32 -S -O2 -mavx avxalign.c -o -
func:                                   # @func
.seh_proc func
        pushq   %rbp
        .seh_pushreg 5
        subq    $80, %rsp
        .seh_stackalloc 80
        leaq    80(%rsp), %rbp
        .seh_setframe 5, 80
        .seh_endprologue
        andq    $-32, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 32(%rsp)
        leaq    32(%rsp), %rcx
        vzeroupper
        callq   foo
        nop
        movq    %rbp, %rsp
        popq    %rbp
        retq

This overallocates stack space in order to be able to align it, and then writes into it with an aligned write.

If not using SEH, by adding -fdwarf-exceptions, it produces different code that also does the alignment:

$ clang -target x86_64-w64-mingw32 -S -O2 -mavx avxalign.c -o - -fdwarf-exceptions
func:                                   # @func
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        andq    $-32, %rsp
        subq    $96, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 32(%rsp)
        leaq    32(%rsp), %rcx
        vzeroupper
        callq   foo
        movq    %rbp, %rsp
        popq    %rbp
        retq

This is what MSVC produces:

$ cl -c -O2 avxalign.c
$ x86_64-w64-mingw32-objdump -d avxalign.obj
0000000000000000 <func>:
   0:   40 55                   rex push %rbp
   2:   48 83 ec 60             sub    $0x60,%rsp
   6:   48 8d 6c 24 40          lea    0x40(%rsp),%rbp
   b:   48 83 e5 e0             and    $0xffffffffffffffe0,%rbp
   f:   c5 fc 10 05 00 00 00    vmovups 0x0(%rip),%ymm0        # 17 <func+0x17>
  16:   00 
  17:   c5 fc 11 45 00          vmovups %ymm0,0x0(%rbp)
  1c:   48 8d 4d 00             lea    0x0(%rbp),%rcx
  20:   c5 f8 77                vzeroupper 
  23:   e8 00 00 00 00          callq  28 <func+0x28>
  28:   48 83 c4 60             add    $0x60,%rsp
  2c:   5d                      pop    %rbp
  2d:   c3                      retq   

This both aligns the pointer, and uses unaligned stores to write it onto the stack

However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC:

$ cat avxalign2.c 
#include <immintrin.h>

void foo(__m256 *x);

void func(void)
{
    __m256 r = _mm256_set1_ps(0.0f);
    foo(&r);
}
$ x86_64-w64-mingw32-gcc -S -O2 -mavx avxalign2.c -o -
func:
        pushq   %rbp
        .seh_pushreg    %rbp
        movq    %rsp, %rbp
        .seh_setframe   %rbp, 0
        subq    $32, %rsp
        .seh_stackalloc 32
        .seh_endprologue
        vxorps  %xmm0, %xmm0, %xmm0
        subq    $64, %rsp
        leaq    63(%rsp), %rcx
        andq    $-32, %rcx
        vmovaps %ymm0, (%rcx)
        vzeroupper
        call    foo
        nop
        movq    %rbp, %rsp
        popq    %rbp
        ret

Here the local variable is properly aligned.

mstorsjo avatar Apr 08 '18 19:04 mstorsjo

@mstorsjo , Great analysis! It means that as long as variables are passed by reference the code generated will work.

The question is, how can it be fixed? Where the right place to post your analysis so people will fix it?

Thank You.

RoyiAvital avatar Apr 08 '18 20:04 RoyiAvital

The GCC bug report that I linked is the relevant one.

mstorsjo avatar Apr 08 '18 20:04 mstorsjo

Do you have account there to link to your analysis (Which states only variables transferred by value are an issue)?

It seems one must have a special granted account to post there which I don't have. It seems a gut from Folding@Home is even asking for this.

Thank You.

RoyiAvital avatar Apr 08 '18 21:04 RoyiAvital

I've retitled the bug since it went completely tangential immediately (2nd and 4th comment).

mingwandroid avatar Apr 08 '18 21:04 mingwandroid

Do you have account there to link to your analysis (Which states only variables transferred by value are an issue)?

I'm not sure if I have a GCC bugzilla account - anyone more affected by the issue than me can take it forward, I just gave the issue a brief look from the clang perspective.

mstorsjo avatar Apr 08 '18 22:04 mstorsjo

@mstorsjo ,

Could you tell how did you compile the above (Clang and GCC)? Version and distribution of MinGW64 / MSYS2? What version of GCC / CLang? Is there a MinGW64 target for Clang (Something like -target x86_64-w64-mingw64?

Thank You.

RoyiAvital avatar Apr 09 '18 06:04 RoyiAvital

@RoyiAvital it you look at the beginning of each snippet you can see used command.

mati865 avatar Apr 09 '18 07:04 mati865

@mati865 , I saw those. They show the command used for compilation, not the environment (Compiler Version, MSYS / CYGWIN / MinGW / MinGW64 version, etc...).

On a side note (for my own knowledge) I wanted to ask what other option Clang has for it <sys> options. Above @mstorsjo used -target x86_64-w64-mingw32 I wonder if -target x86_64-w64-mingw64 makes any difference (If exists)? What other option are there beside mingw32 For 64 Bit Windows (The <sys> in -target x86_64-w64-<sys>).

RoyiAvital avatar Apr 09 '18 07:04 RoyiAvital

The environment used in the examples does not matter as it doesn't include anything or link anything - the command line contains everything needed. I tested with a recent clang svn version (the latest from about a day ago), but I'm pretty sure at least the last couple releases should behave the same.

There's no -mingw64 target, but -msvc and a maybe -cygwin.

mstorsjo avatar Apr 09 '18 09:04 mstorsjo

OK. Though I still think the GCC version could be important, no?

Just to add information, I think this comment is important - https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64?noredirect=1#comment86499640_30928265.

RoyiAvital avatar Apr 10 '18 04:04 RoyiAvital

Sorry for digging up this old thread, but hey I bumped into it!

@RoyiAvital: Though I still think the GCC version could be important, no?

It's no longer important now. My own tests plus 54412 seems to suggest that it happens on all SEH-enabled versions of GCC, which is everything after something like 4.8.9.

If it does relevant, anyone working on fixing it?

The GCC people are (understandably) looking to do a proper fix to make the compiler actually understand how alignments work, but they seem a bit stuck.

Artoria2e5 avatar Aug 23 '21 11:08 Artoria2e5

I have hacked together a very ugly patch that basically makes every place that generates aligned load/stores use unaligned instructions instead. I remember seeing someone say on Nehalem and newer the peanlty is very small or something.

0001-Force-use-unaligned-insns-as-49001-workaround.patch.txt

I will try to send it to makepkg to build and test it if possible, but in case I burn out before then you all know what to try. Heck, I still owe Alex & the cygwin people a cmdline parser...


A more graceful way to do the workaround may be changing as to add a flag for doing this sort of replacement. But come on, a hack is a hack.

Artoria2e5 avatar Aug 23 '21 14:08 Artoria2e5

Hi all!

I have a feeling that GCC 11.2.0 has this issue fixed. At least I cannot reproduce the crash in a trivial test. I have managed to compile packages. If you would like to test them, please fetch here: https://yadi.sk/d/rL4Lo6HFkAojPA

I will upload the sources of the script a bit later today (I had to change the patchset and have a problem with updating checksums)

UPD:

Here is a commit that makes it possible to compile GCC 11.2.0 for MinGW64: https://github.com/dimula73/MINGW-packages/commit/7065447358e9d98419482e1a00f0e896f72b784b

I'm not sure I know how MSYS handles multiple compilers at the same time, so I'm a bit of scared to propose a PR for that :)

dimula73 avatar Aug 26 '21 06:08 dimula73

I don't think it fixes the issue -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=28103 from bug 54412 still segfaults.

PS: Hmm, any idea why the MSYS2 GNU tar is not recognizing the .zst file extension even when zstd is installed? I really thought they would've, you know, updated for that. Well anyways pacman -U works.

Artoria2e5 avatar Aug 29 '21 10:08 Artoria2e5

I don't think it fixes the issue -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=28103 from bug 54412 still segfaults.

Hm... then it is just a coincidence that my local test passes :(

dimula73 avatar Aug 30 '21 12:08 dimula73

However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC:

that's exactly the issue I'm seeing now: passing by value triggers segfault when compiled for haswell/skylake

pps83 avatar Apr 29 '25 09:04 pps83