penguinV AVX-SSE transistion penalties

the transistion between AVX and SSE cause penalties. To avoid those penalties, we might want to add _mm256_zeroupper() at the end of all AVX SIMD function.

https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties (3.3. Method 3: Zeroing Registers)
https://www.agner.org/optimize/optimizing_cpp.pdf (p.109 - 12.1)

Jan 09 '19 17:01 0x72D0

We need to consider this as all our functions have AVX and SSE so ideally we shouldn't face switching but idea is valid for a discussion.

Jan 09 '19 23:01 ihhub

it's not the case for all the function, the hough transform for example. Also, sometimes the compiler optimize with SIMD instruction. So if we run Accumulate for AVX for example and after that we run an hough transform and the compiler add SSE instruction in the hough transform, then we slow down each SSE-AVX transition by 10 cycle.

Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.

1. Introduction to AVX-SSE Transition Penalties

Jan 10 '19 01:01 0x72D0

No-no, I understand your concern. What I meant is that as of now we have all functions which are implemented in both AVX and SSE so if we run the code on CPU with AVX support we should run only AVX code without switching to SSE. I agree regarding SIMD optimisation. Just is it worth to do this for 10 cycles if we switch from AVX to SSE for functions where we process millions of bytes?

Jan 10 '19 01:01 ihhub

@0x72D0 we could just modify such code:

#ifdef PENGUINV_AVX_SET
#define AVX_CODE( code )          \
if ( simdType == avx_function ) { \
    code;                         \
    put instruction here <---
    return;                       \
}
#else
#define AVX_CODE( code )
#endif

But what would be a penalty for multithreading case?

Jan 10 '19 01:01 ihhub

yeah maybe with multithreading we would have a significant performance loss. Also I find this topic when searching for the VZEROUPPER latencies:

When AVX was introduced with 256-bit vector registers, we were told to use the instruction VZEROUPPER to avoid a severe penalty when switching between VEX and non-VEX code. Four generations of Intel processors had such a penalty (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell). AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch. They have no need for the VZEROUPPER.

https://www.agner.org/optimize/blog/read.php?i=789

So if skylake is not affected, the state switch penalties might just disappear with time

Jan 10 '19 01:01 0x72D0

I put this issue to WishList as it's not so urgent but it's good to review in future for sure.

Jan 10 '19 02:01 ihhub