Remove use of native MMX intrinsics
MMX is broken on modern compilers and should be avoided at all costs.
Reasons:
- ⚠️Clang and GCC DO NOT automatically switch between MMX and x87 resulting in undefined behavior and potentially X87 overflows. MMX expects the programmer to manually keep X87 and MMX separate with
_mm_empty()in between, and in a library like this, the user shouldn't be expected to do this. - GCC converts MMX to SSE2 on x64.
- Clang does not know how to properly optimize MMX vectors and generates garbage, especially when it clashes with autovectorization
- GCC and Clang will never autovectorize MMX
- MMX is not supported on the Windows x64 calling convention — only context switch safety is guaranteed
- MMX has fewer registers on x64
- When using NEON, half vectors are constantly converted back and forth so conversions should be fast.
- Operating systems that require MMX without requiring SSE2 are relatively uncommon.
- SSE2 is always available on x64
If anyone wants to try this out, define SIMDE_X86_MMX_NO_NATIVE and no native MMX calls will be made.
Operating systems that require MMX without requiring SSE2 are relatively uncommon.
FYI: Debian i386 (really i686) includes neither MMX nor SSE
https://wiki.debian.org/ArchitectureSpecificsMemo#i386-1
So primarily looking at neon, what if we had
typedef struct {
...
} simde_uint8x8_t;
typedef struct {
...
#ifdef SIMDE_X86_SSE2_NATIVE
__m128i m128i;
#endif
} simde_private_uint8x8_t;
simde_private_uint8x8_t simde_uint8x8_t_to_private(simde_uint8x8_t x) {
simde_private_uint8x8_t r;
#ifdef SIMDE_X86_SSE2_NATIVE
r.m128i = _mm_loadl_epi64((const __m128i *)&x);
#else
...
#endif
}
simde_uint8x8_t simde_uint8x8_t_from_private(simde_private_uint8x8_t x) {
simde_uint8x8_t r;
#ifdef SIMDE_X86_SSE2_NATIVE
_mm_storel_epi64((__m128i *)&r, x.m128i);
#else
...
#endif
}
As long as the compiler can eliminate those loads we are golden. If not, some convincing must be done. None of the __m64 intrinsics can be used on MSVC x64, so _mm_cvtpi64_epi64 is out of the question. Once that is done, everything can be done in clean SSE2. This should also be done for simde__m64.
Ah, I understand better now. You are proposing that on X86_SSE2_NATIVE, 128 bit intrinsics are used to implement smaller data structures in other APIs to avoid use of MMX. That sounds reasonable, though it probably should controlled by a flag for those who may not want that approach.
Yes, but also I say no MMX allowed, period. It isn't something that is transparently handled by the compiler (unlike, say, vzeroupper) which is inappropriate for a library that attempts to transparently handle intrinsics written for another platform.
As a matter of fact, I think the mmintrin.h polyfill should use SSE2 always instead of the awkward MMX.
Example of MMX breaking things
Messing with the optimization levels can result in differing values, even if you put _mm_empty() after each intrinsic (which would force the vector to memory each time, killing performance). And while this is a particularly odd thing for manual MMX intrinsics, if this is hidden by SIMDE intrinsics, it could easily corrupt the floating point stack as a very subtle bug.
This goes in the category of a very rare but specific scenario that won't usually cause an issue until it does and everything is set on fire while the programmer tries to figure out why their math library randomly stops working.
128 bit intrinsics are used to implement smaller data structures in other APIs to avoid use of MMX.
This should actually be kept in general, unless array indexing is used. Add helpers to convert things to and from a 128-bit vector with undefined upper 64 bits (e.g. splat, zero extend, __builtin_shufflevector(x, x, 0, -1)), and use the 128-bit intrinsic. That means that any complicated intrinsic functions only have to be done once.
simde_uint8x8_t
simde_vadd_u8(simde_uint8x8_t a, simde_uint8x8_t b)
{
#if NEON
return vadd_u8(a, b);
#elif 128-bit vector
simde_uint8x16_t a_ = simde_vcombine_u8_undef(a);
simde_uint8x16_t b_ = simde_vcombine_u8_undef(b);
simde_uint8x16_t r_ = simde_vaddq_u8(a, b);
return simde_vget_low_u8(r_);
#else
// Loop
#endif
}
NEON is the only instruction set that cheaply converts between half vectors and full vectors, so this should be a win as long as we can avoid excessive swapping.
Sure, please do an experiment and we can study the results.
Some deep dives into codegen brought me to some interesting things:
- The ABI for
__attribute__((vector_size(8)))/__m64is inconsistent on 32-bit x86.
| Compiler | Features | vector_size(8) |
__m64 |
|---|---|---|---|
| GCC | no MMX | Stack | N/A |
| GCC | MMX | MMX | MMX |
| Clang/ICX | always | stack | stack |
| ICC classic | No SSE | stack | error |
| ICC classic | SSE | SSE AND MMX | MMX |
| MSVC | always | N/A | MMX |
- GCC i386 half vectors are always MMX registers but GCC doesn't know what to do with them so it extracts. It is truly awful.
typedef int m64 __attribute__((vector_size(8)));
m64 foo(m64 a, m64 b)
{
return a + b;
}
to emit this if MMX is enabled (regardless if SSE2 is available or not):
foo:
push edi
push esi
push ebx
sub esp, 40
movq QWORD PTR [esp], mm0
mov ecx, DWORD PTR [esp]
mov ebx, DWORD PTR [esp+4]
movq QWORD PTR [esp], mm1
mov eax, DWORD PTR [esp]
mov edx, DWORD PTR [esp+4]
mov esi, ecx
mov DWORD PTR [esp+20], ebx
add esi, eax
mov DWORD PTR [esp+8], eax
mov DWORD PTR [esp+12], edx
mov eax, DWORD PTR [esp+12]
movd mm0, esi
add eax, DWORD PTR [esp+20]
mov DWORD PTR [esp+16], ecx
add esp, 40
pop ebx
movd mm2, eax
punpckldq mm0, mm2
pop esi
pop edi
ret
GCC and Clang x64 use SSE2 btw.
If only we didn't have to have sizeof == 8 but we can't mess up direct pointer access. Maybe we should suck it up and use int64_t or double. Or we could do something like #pragma GCC target("no-mmx") but that likely causes its own problems.
NEON_2_SSE uses an array union. https://github.com/intel/ARM_NEON_2_x86_SSE/blob/master/NEON_2_SSE.h
I say that for best results we should use
int64_tfor x86 (because double uses X87 and GCC vector uses MMX). We can use_mm_set1_epi64xor_mm_loadl_epi64on MSVC and partialmemcpyin GCC and Clang.doublefor x64 (because of_mm_set_sdand already being an xmm register).structfor everything else.
However, I think we should have an explicit option to allow vector sizes to be larger, only promising the memory layout with load/store intrinsics, e.g. SIMDE_ALLOW_OVERSIZED_VECTOR_TYPES.
This means that, e.g., vcombine(x, y) can be _mm_unpacklo_epi64(x, y) and vget_high can be _mm_shuffle_epi32(x, _MM_SHUFFLE(0, 0, 3, 2)) or _mm_bsrli_si128(x, 8))
Ok idea for a roadmap:
High priority:
- Remove MMX intrinsics
- Change half vector type to avoid dangerous+slow MMX vectors
Low priority:
- Convert most half NEON intrinsics to
vget_low(intrinsicq(vcombine(x, 0)))on platforms with native vector types. This also has the benefit of reducing the complexity. - Research "quick combines"
- Implement oversized vectors?
I have the minimum cost half vector ABI and conversion figured out. It unfortunately needs a hack on GCC to avoid excess movqs though.
x86 ABI
typedef struct {
double f64;
} halfvec;
Reasoning:
- While
doublewould be good if we could usevectorcall, it isn't universally available. - Passed on stack like normal arguments
- No inconsistent use of MMX regs
- Most stable option for cdecl
- Most efficient return without SSE4.1
int64_treturns ineax:edxwhich is slow withoutpextrddoublereturns inst(0)requiring two memory accesses since there is no direct X87<->SSE transfer
x64 ABI
typedef double halfvec;
- Passed and returned in XMM registers
- While
struct { double }is passed in registers on the SysV ABI, the Windows ABI passes it on the stack
- While
- Literally zero cost
Clang, MSVC, portable
Using _mm_set_sd and mm_cvtsd_f64 work flawlessly on these compilers. It is also portable.
__m128i halfvec_to_m128i(halfvec x)
{
return _mm_castpd_si128(_mm_set_sd(x.f64)); // or just x on x64
}
halfvec m128i_to_halfvec(__m128i x)
{
halfvec r;
r.f64 = _mm_cvtsd_f64(_mm_castsi128_pd(x)); // or just r on x64
return r;
}
GCC hack
GCC insists on a movq between conversions. This isn't terrible, as it is a fast instruction, but it is still going to waste code space and stuff.
Things I have tried:
_mm_set_sd_mm_loadl_epi64_mm_load_sdmemcpy- Directly assigning to vector elements
_mm_cvtsi64_si128(not available on 32-bit)__builtin_shufflevector
The only remaining option is a register reinterpret with an empty inline assembly statement, which works flawlessly.
__m128i halfvec_to_m128i(halfvec x)
{
__m128i r;
// register reinterpret
__asm__("" : "=x" (r) : "0" (x.f64)); // or just x
return r;
}
halfvec m128i_to_halfvec(__m128i x)
{
halfvec r;
// register reinterpret
__asm__("" : "=x" (r.f64) : "0" (x)); // or just r
return r;
}
Explanation of the ASM:
__asm__(
"" // empty, no explicit instructions
: // Outputs:
"=x" // Assign to SSE vector..
(r) // ...in the variable r
: // Inputs:
"0" // In the same constraint as operand 0...
(x.f64) // ...put x.f64
);
The other architectures should use normal GCC vector(8) and __builtin_shufflevector(x, x, 0, 1, -1, -1) if it is available.
Also it might be good to have a testing mode that DEADBEEFs the upper bits.
doublefor x64 (because of_mm_set_sdand already being an xmm register).
Why double instead of int64_t? I see the conveniance of using __mm_set_sd but is it really faster than any other call? It is marked Sequence with a warning; though that seems to be true for all the _mm_set_ functions.
Putting it in a double is extremely convenient because by default, doubles are already going to be in XMM registers.
All we need is the compiler to eliminate the zero extension (which Clang and MSVC do, GCC doesn't hence the hack) and all these _mm_set_sds disappear.
int64:
movq xmm0, rdi
movq xmm1, rsi
paddw xmm0, xmm1
movq rax, xmm0
ret
double:
paddw xmm0, xmm1
ret
DOSBox Staging plans to use SIMDe, so please consider keeping MMX as an option.
DOSBox Staging plans to use SIMDe, so please consider keeping MMX as an option.
Hey @Torinde , that is great news! This issue is proposing that SIMDe not use MMX intrinsics in any of the implementations. It doesn't propose the removal of any MMX function definitions in SIMDe. So DOSBox Staging should be fine 🙂
Does SIMDe use x87 intrinsics?
Does SIMDe use x87 intrinsics?
Some optimized implementations do so, usually only when there is nothing better available.