simde icon indicating copy to clipboard operation
simde copied to clipboard

Remove use of native MMX intrinsics

Open easyaspi314 opened this issue 2 years ago • 25 comments

MMX is broken on modern compilers and should be avoided at all costs.

Reasons:

  • ⚠️Clang and GCC DO NOT automatically switch between MMX and x87 resulting in undefined behavior and potentially X87 overflows. MMX expects the programmer to manually keep X87 and MMX separate with _mm_empty() in between, and in a library like this, the user shouldn't be expected to do this.
  • GCC converts MMX to SSE2 on x64.
  • Clang does not know how to properly optimize MMX vectors and generates garbage, especially when it clashes with autovectorization
  • GCC and Clang will never autovectorize MMX
  • MMX is not supported on the Windows x64 calling convention — only context switch safety is guaranteed
  • MMX has fewer registers on x64
  • When using NEON, half vectors are constantly converted back and forth so conversions should be fast.
  • Operating systems that require MMX without requiring SSE2 are relatively uncommon.
  • SSE2 is always available on x64

easyaspi314 avatar May 21 '23 19:05 easyaspi314

If anyone wants to try this out, define SIMDE_X86_MMX_NO_NATIVE and no native MMX calls will be made.

mr-c avatar May 21 '23 20:05 mr-c

Operating systems that require MMX without requiring SSE2 are relatively uncommon.

FYI: Debian i386 (really i686) includes neither MMX nor SSE

https://wiki.debian.org/ArchitectureSpecificsMemo#i386-1

mr-c avatar May 21 '23 20:05 mr-c

So primarily looking at neon, what if we had

typedef struct {
   ...
} simde_uint8x8_t;
typedef struct {
   ...
#ifdef SIMDE_X86_SSE2_NATIVE
  __m128i m128i;
#endif
} simde_private_uint8x8_t;

simde_private_uint8x8_t simde_uint8x8_t_to_private(simde_uint8x8_t x) {
    simde_private_uint8x8_t r;
#ifdef SIMDE_X86_SSE2_NATIVE
    r.m128i = _mm_loadl_epi64((const __m128i *)&x);
#else
   ...
#endif
}
simde_uint8x8_t simde_uint8x8_t_from_private(simde_private_uint8x8_t x) {
    simde_uint8x8_t r;
#ifdef SIMDE_X86_SSE2_NATIVE
    _mm_storel_epi64((__m128i *)&r, x.m128i);
#else
   ...
#endif
}

As long as the compiler can eliminate those loads we are golden. If not, some convincing must be done. None of the __m64 intrinsics can be used on MSVC x64, so _mm_cvtpi64_epi64 is out of the question. Once that is done, everything can be done in clean SSE2. This should also be done for simde__m64.

easyaspi314 avatar May 23 '23 07:05 easyaspi314

Ah, I understand better now. You are proposing that on X86_SSE2_NATIVE, 128 bit intrinsics are used to implement smaller data structures in other APIs to avoid use of MMX. That sounds reasonable, though it probably should controlled by a flag for those who may not want that approach.

mr-c avatar May 23 '23 08:05 mr-c

Yes, but also I say no MMX allowed, period. It isn't something that is transparently handled by the compiler (unlike, say, vzeroupper) which is inappropriate for a library that attempts to transparently handle intrinsics written for another platform.

As a matter of fact, I think the mmintrin.h polyfill should use SSE2 always instead of the awkward MMX.

easyaspi314 avatar May 23 '23 12:05 easyaspi314

Example of MMX breaking things

Messing with the optimization levels can result in differing values, even if you put _mm_empty() after each intrinsic (which would force the vector to memory each time, killing performance). And while this is a particularly odd thing for manual MMX intrinsics, if this is hidden by SIMDE intrinsics, it could easily corrupt the floating point stack as a very subtle bug.

This goes in the category of a very rare but specific scenario that won't usually cause an issue until it does and everything is set on fire while the programmer tries to figure out why their math library randomly stops working.

easyaspi314 avatar May 23 '23 17:05 easyaspi314

128 bit intrinsics are used to implement smaller data structures in other APIs to avoid use of MMX.

This should actually be kept in general, unless array indexing is used. Add helpers to convert things to and from a 128-bit vector with undefined upper 64 bits (e.g. splat, zero extend, __builtin_shufflevector(x, x, 0, -1)), and use the 128-bit intrinsic. That means that any complicated intrinsic functions only have to be done once.

simde_uint8x8_t
simde_vadd_u8(simde_uint8x8_t a, simde_uint8x8_t b)
{
  #if NEON
    return vadd_u8(a, b);
  #elif 128-bit vector
     simde_uint8x16_t a_ = simde_vcombine_u8_undef(a);
     simde_uint8x16_t b_ = simde_vcombine_u8_undef(b);
     simde_uint8x16_t r_ = simde_vaddq_u8(a, b);
     return simde_vget_low_u8(r_);
  #else
     // Loop
  #endif
}

NEON is the only instruction set that cheaply converts between half vectors and full vectors, so this should be a win as long as we can avoid excessive swapping.

easyaspi314 avatar May 23 '23 19:05 easyaspi314

Sure, please do an experiment and we can study the results.

mr-c avatar May 24 '23 07:05 mr-c

Some deep dives into codegen brought me to some interesting things:

  1. The ABI for __attribute__((vector_size(8)))/__m64 is inconsistent on 32-bit x86.
Compiler Features vector_size(8) __m64
GCC no MMX Stack N/A
GCC MMX MMX MMX
Clang/ICX always stack stack
ICC classic No SSE stack error
ICC classic SSE SSE AND MMX MMX
MSVC always N/A MMX
  1. GCC i386 half vectors are always MMX registers but GCC doesn't know what to do with them so it extracts. It is truly awful.
typedef int m64 __attribute__((vector_size(8)));

m64 foo(m64 a, m64 b)
{
    return a + b;
}

to emit this if MMX is enabled (regardless if SSE2 is available or not):

foo:
        push    edi
        push    esi
        push    ebx
        sub     esp, 40
        movq    QWORD PTR [esp], mm0
        mov     ecx, DWORD PTR [esp]
        mov     ebx, DWORD PTR [esp+4]
        movq    QWORD PTR [esp], mm1
        mov     eax, DWORD PTR [esp]
        mov     edx, DWORD PTR [esp+4]
        mov     esi, ecx
        mov     DWORD PTR [esp+20], ebx
        add     esi, eax
        mov     DWORD PTR [esp+8], eax
        mov     DWORD PTR [esp+12], edx
        mov     eax, DWORD PTR [esp+12]
        movd    mm0, esi
        add     eax, DWORD PTR [esp+20]
        mov     DWORD PTR [esp+16], ecx
        add     esp, 40
        pop     ebx
        movd    mm2, eax
        punpckldq       mm0, mm2
        pop     esi
        pop     edi
        ret

GCC and Clang x64 use SSE2 btw.

If only we didn't have to have sizeof == 8 but we can't mess up direct pointer access. Maybe we should suck it up and use int64_t or double. Or we could do something like #pragma GCC target("no-mmx") but that likely causes its own problems.

easyaspi314 avatar May 25 '23 04:05 easyaspi314

NEON_2_SSE uses an array union. https://github.com/intel/ARM_NEON_2_x86_SSE/blob/master/NEON_2_SSE.h

I say that for best results we should use

  • int64_t for x86 (because double uses X87 and GCC vector uses MMX). We can use _mm_set1_epi64x or _mm_loadl_epi64 on MSVC and partial memcpy in GCC and Clang.
  • double for x64 (because of _mm_set_sd and already being an xmm register).
  • struct for everything else.

However, I think we should have an explicit option to allow vector sizes to be larger, only promising the memory layout with load/store intrinsics, e.g. SIMDE_ALLOW_OVERSIZED_VECTOR_TYPES.

This means that, e.g., vcombine(x, y) can be _mm_unpacklo_epi64(x, y) and vget_high can be _mm_shuffle_epi32(x, _MM_SHUFFLE(0, 0, 3, 2)) or _mm_bsrli_si128(x, 8))

easyaspi314 avatar May 25 '23 17:05 easyaspi314

Ok idea for a roadmap:

High priority:

  1. Remove MMX intrinsics
  2. Change half vector type to avoid dangerous+slow MMX vectors

Low priority:

  1. Convert most half NEON intrinsics to vget_low(intrinsicq(vcombine(x, 0))) on platforms with native vector types. This also has the benefit of reducing the complexity.
  2. Research "quick combines"
  3. Implement oversized vectors?

easyaspi314 avatar May 26 '23 14:05 easyaspi314

I have the minimum cost half vector ABI and conversion figured out. It unfortunately needs a hack on GCC to avoid excess movqs though.

x86 ABI

typedef struct {
    double f64;
} halfvec;

Reasoning:

  • While double would be good if we could use vectorcall, it isn't universally available.
  • Passed on stack like normal arguments
    • No inconsistent use of MMX regs
    • Most stable option for cdecl
  • Most efficient return without SSE4.1
    • int64_t returns in eax:edx which is slow without pextrd
    • double returns in st(0) requiring two memory accesses since there is no direct X87<->SSE transfer

x64 ABI

typedef double halfvec;
  • Passed and returned in XMM registers
    • While struct { double } is passed in registers on the SysV ABI, the Windows ABI passes it on the stack
  • Literally zero cost

Clang, MSVC, portable

Using _mm_set_sd and mm_cvtsd_f64 work flawlessly on these compilers. It is also portable.

__m128i halfvec_to_m128i(halfvec x)
{
    return _mm_castpd_si128(_mm_set_sd(x.f64)); // or just x on x64
}
halfvec m128i_to_halfvec(__m128i x)
{
    halfvec r;
    r.f64 = _mm_cvtsd_f64(_mm_castsi128_pd(x)); // or just r on x64
    return r;
}

GCC hack

GCC insists on a movq between conversions. This isn't terrible, as it is a fast instruction, but it is still going to waste code space and stuff.

Things I have tried:

  • _mm_set_sd
  • _mm_loadl_epi64
  • _mm_load_sd
  • memcpy
  • Directly assigning to vector elements
  • _mm_cvtsi64_si128 (not available on 32-bit)
  • __builtin_shufflevector

The only remaining option is a register reinterpret with an empty inline assembly statement, which works flawlessly.

__m128i halfvec_to_m128i(halfvec x)
{
    __m128i r;
    // register reinterpret
    __asm__("" : "=x" (r) : "0" (x.f64)); // or just x
    return r;
}
halfvec m128i_to_halfvec(__m128i x)
{
    halfvec r;
    // register reinterpret
    __asm__("" : "=x" (r.f64) : "0" (x)); // or just r
    return r;
}

Explanation of the ASM:

__asm__(
    ""        // empty, no explicit instructions
    :         // Outputs:
      "=x"    //   Assign to SSE vector..
      (r)     //   ...in the variable r
    :         // Inputs:
      "0"     //   In the same constraint as operand 0...
      (x.f64) //   ...put x.f64
);

The other architectures should use normal GCC vector(8) and __builtin_shufflevector(x, x, 0, 1, -1, -1) if it is available.

Also it might be good to have a testing mode that DEADBEEFs the upper bits.

easyaspi314 avatar May 26 '23 21:05 easyaspi314

  • double for x64 (because of _mm_set_sd and already being an xmm register).

Why double instead of int64_t? I see the conveniance of using __mm_set_sd but is it really faster than any other call? It is marked Sequence with a warning; though that seems to be true for all the _mm_set_ functions.

mr-c avatar May 27 '23 10:05 mr-c

Putting it in a double is extremely convenient because by default, doubles are already going to be in XMM registers.

All we need is the compiler to eliminate the zero extension (which Clang and MSVC do, GCC doesn't hence the hack) and all these _mm_set_sds disappear.

int64:
    movq    xmm0, rdi
    movq    xmm1, rsi
    paddw   xmm0, xmm1
    movq    rax, xmm0
    ret
double:
    paddw   xmm0, xmm1
    ret

easyaspi314 avatar May 27 '23 14:05 easyaspi314

DOSBox Staging plans to use SIMDe, so please consider keeping MMX as an option.

Torinde avatar Feb 25 '24 18:02 Torinde

DOSBox Staging plans to use SIMDe, so please consider keeping MMX as an option.

Hey @Torinde , that is great news! This issue is proposing that SIMDe not use MMX intrinsics in any of the implementations. It doesn't propose the removal of any MMX function definitions in SIMDe. So DOSBox Staging should be fine 🙂

mr-c avatar Feb 25 '24 20:02 mr-c

Does SIMDe use x87 intrinsics?

Torinde avatar Apr 13 '24 17:04 Torinde

Does SIMDe use x87 intrinsics?

Some optimized implementations do so, usually only when there is nothing better available.

mr-c avatar Apr 20 '24 18:04 mr-c