Vc
Vc copied to clipboard
GCC spills function call arguments into unaligned stack, causing a crash
Vc version / revision | Operating System | Compiler & Version | Compiler Flags | Assembler & Version | CPU |
---|---|---|---|---|---|
Vc 1.4.1 and 1.3.3 | Windows 10 | Mingw64: GCC 7.3.0 and 8.1.0 | -std=c++11 | AVX-capable |
Hi, all!
I have a weird problem in Krita on Windows. It might be actually not Vc's problem, but GCC, but since I cannot reproduce it in a limited scope, I would like to ask your opinion on that.
Here in Krita we have Vc-optimized code that does blending of colors in 32-bit float color space. Here is the link to the source. There we use Vc::InterleavedMemoryWrapper for reading/writing interleaved pixels into local variables.
The problem happens in line 142, when we try to write pixels back into memory. It looks like GCC tries to de-inline some VC functions, and spills some variables into the stack (afaict, to fit calling conventions). GCC uses vmovaps for that, but the stack is not aligned, therefore the application crashes. Here is the assembly around the line:
# write pixels
0x00007ffa5f4dd580 <+2656>: vmovdqu 0x60(%rsp),%ymm5
0x00007ffa5f4dd586 <+2662>: mov %rsi,0x20(%rbx)
0x00007ffa5f4dd58a <+2666>: mov %rbx,%rdx
0x00007ffa5f4dd58d <+2669>: mov %rsi,%rcx
0x00007ffa5f4dd590 <+2672>: mov 0x168(%rsp),%rax
0x00007ffa5f4dd598 <+2680>: vmovdqa %ymm5,(%rbx)
0x00007ffa5f4dd59c <+2684>: mov 0x158(%rsp),%r9
0x00007ffa5f4dd5a4 <+2692>: mov 0x160(%rsp),%r8
# %rsp has alignment by 16 bytes only!
=> 0x00007ffa5f4dd5ac <+2700>: vmovaps %ymm3,0x1f0(%rsp)
0x00007ffa5f4dd5b5 <+2709>: vmovaps %ymm2,0x1d0(%rsp)
0x00007ffa5f4dd5be <+2718>: mov %rax,0x28(%rsp)
0x00007ffa5f4dd5c3 <+2723>: mov 0x170(%rsp),%rax
0x00007ffa5f4dd5cb <+2731>: vmovaps %ymm1,0x1b0(%rsp)
0x00007ffa5f4dd5d4 <+2740>: vmovaps %ymm6,0x190(%rsp)
0x00007ffa5f4dd5dd <+2749>: mov %rax,0x20(%rsp)
#interleave
0x00007ffa5f4dd5e2 <+2754>: callq 0x7ffa5f5111c0 <Vc_1::Detail::InterleaveImpl<Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, 8, 32ull>::interleave<Vc_1::SimdArray<int, 8ull, Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ull> >(float*, Vc_1::SimdArray<int, 8ull, Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ull> const&, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>)>
0x00007ffa5f4dd5e7 <+2759>: vmovups 0xa0(%rsp),%ymm5
0x00007ffa5f4dd5f0 <+2768>: vmovups 0x80(%rsp),%ymm4
# goto pixel_cycle_start;
0x00007ffa5f4dd5f9 <+2777>: jmpq 0x7ffa5f4dd170 <KoStreamedMath<(Vc_1::Implementation)7>::genericComposite<false, false, OverCompositor128<float, unsigned int, false, true>, 16>(KoCompositeOp::ParameterInfo const&)+1616>
I have found out that if I go to avx\detail.h
and change all static inline
into static Vc_INTRINSIC
, then the problem goes away (there is no crash, at least in this very place). So it looks like the problem is really in doing the function call.
It also looks like there is a related bugreport on GCC's bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412
So the questions I would like to ask you:
- Is it a known problem? Perhaps there is some known workaround for that?
- Can I pass some options to GCC or Vc to avoid/workaround such behavior? I tried to pass -O2 and -O3 and it didn't help.
- Do you have an idea how I can reproduce this problem in a localized testcase? I tried to make a test (here is the link), but neither on GodBolt, nor in real test on Windows+GCC the problem appears. In both cases the interleave function becomes inlined, and the crash doesn't happen.
This bug in MSYS seems to be related as well: https://github.com/msys2/MSYS2-packages/issues/1209
This looks like it hits the linked GCC PR, yes. However from what I learned about __cdecl
, MSVC makes any overaligned by-value function call argument ill-formed. This seems to be a limitation of the calling convention. So the code GCC compiles to crashing code, would be ill-formed on MSVC. __vectorcall
(or /Gv
) can lift this limitation on MSVC. Dunno if GCC supports __vectorcall
, that would be the easiest solution.
Otherwise we'd need to change all internal APIs that pass overaligned objects by value to pass via const-ref instead. But please #ifdef
ed on whether it's on Windows.