sse2neon MSVC support

MSVC support will be nice.

Also, is it worth the effort, if I make a pull request with some changes that will improve the compatibility with MSVC? (I know almost nothing about ARM, but I can try to fix some macros, etc, maybe small things, if no one mind about the idea)

Apr 13 '21 17:04 Xottab-DUTY

MSVC support will be nice. Also, is it worth the effort, if I make a pull request with some changes that will improve the compatibility with MSVC? (I know almost nothing about ARM, but I can try to fix some macros, etc, maybe small things, if no one mind about the idea)

Welcome! Can you enable GitHub Actions integration with Visual Studio? See https://devblogs.microsoft.com/visualstudio/whats-new-with-github-actions-tooling-in-visual-studio/

Apr 13 '21 17:04 jserv

When I try to compile sse2neon for ARM Windows using MSVC, it triggers the #error "Macro name collisions may happen with unsupported compiler." line in the below code, which fails the build:

https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h#L83

#if defined(__GNUC__) || defined(__clang__)
#pragma push_macro("FORCE_INLINE")
#pragma push_macro("ALIGN_STRUCT")
#define FORCE_INLINE static inline __attribute__((always_inline))
#define ALIGN_STRUCT(x) __attribute__((aligned(x)))
#ifndef likely
#define likely(x) __builtin_expect(!!(x), 1)
#endif
#ifndef unlikely
#define unlikely(x) __builtin_expect(!!(x), 0)
#endif
#else
#error "Macro name collisions may happen with unsupported compiler."
#ifdef FORCE_INLINE
#undef FORCE_INLINE
#endif
#define FORCE_INLINE static inline
#ifndef ALIGN_STRUCT
#define ALIGN_STRUCT(x) __declspec(align(x))
#endif
#endif
#ifndef likely
#define likely(x) (x)
#endif
#ifndef unlikely
#define unlikely(x) (x)
#endif

To fix this, sse2neon should add a section that starts with #elif defined(_MSC_VER) to check for MSVC and have code that works for MSVC. When I try to comment out the #error line it does not work, I get hundreds of other errors instead.

Dec 08 '21 19:12 aaronfranke

Line 83 is indeed #warning rather than #error. At present, SSE2NEON is built with clang and gcc. @aaronfranke, Can you remove the part from Line 105 to Line 120 and build with MSVC?

Dec 09 '21 02:12 jserv

@jserv When I comment out the error line and lines 105 to 120, I get hundreds of errors. Here are some of the first ones:

godot\thirdparty\embree\common\simd\arm\sse2neon.h(465): error C3861: '__builtin_prefetch': identifier not found
godot\thirdparty\embree\common\simd\arm\sse2neon.h(475): error C2065: '__asm__': undeclared identifier
godot\thirdparty\embree\common\simd\arm\sse2neon.h(475): error C2146: syntax error: missing ';' before identifier '__volatile__'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(475): error C3861: '__volatile__': identifier not found
godot\thirdparty\embree\common\simd\arm\sse2neon.h(826): error C2440: 'type cast': cannot convert from '__m64' to 'int64_t'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(826): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): error C2440: 'type cast': cannot convert from '__m64' to 'int64_t'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): error C2440: 'type cast': cannot convert from '__m64' to 'int64_t'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): error C2660: '_mm_set_epi64x': function does not take 1 arguments
godot\thirdparty\embree\common\simd\arm\sse2neon.h(847): note: see declaration of '_mm_set_epi64x'

Dec 09 '21 04:12 aaronfranke

The biggest challenge so far is the result of this commit: 8c24b9458989e87776532ab209342723e2d568ed

MSVC does not support statement expressions, as they are a GNU C extension. Because intrinsics that take immediates require them to be constant expressions, we need to come up with some adequate substitute. So far, I can see only two options:

function templates that take the immediate as a template parameter (i.e. how the old implementation worked)
lambda functions that do not explicitly capture the immediate

Of course, both of these rely on C++. Is it acceptable to require compiling as C++ when using MSVC?

Oct 22 '22 12:10 invertego

I can see only two options:

function templates that take the immediate as a template parameter (i.e. how the old implementation worked)

lambda functions that do not explicitly capture the immediate

Of course, both of these rely on C++. Is it acceptable to require compiling as C++ when using MSVC?

Let's try to introduce some wrapper based on C++ function templates for MSCV support. @invertego, can you show some proposed changes?

Oct 22 '22 14:10 jserv

Here's _mm_shuffle_ps_default as an example. I have included both template and lambda implementations for demonstration purposes, but we only need to pick one. Assume SSE2NEON_RETURN is defined as return or nothing depending on whether statement expressions are in use.

#define _mm_shuffle_ps_default_body(a, b, imm)                             \
        float32x4_t ret;                                                   \
        ret = vmovq_n_f32(                                                 \
            vgetq_lane_f32(vreinterpretq_f32_m128(a), (imm) & (0x3)));     \
        ret = vsetq_lane_f32(                                              \
            vgetq_lane_f32(vreinterpretq_f32_m128(a), ((imm) >> 2) & 0x3), \
            ret, 1);                                                       \
        ret = vsetq_lane_f32(                                              \
            vgetq_lane_f32(vreinterpretq_f32_m128(b), ((imm) >> 4) & 0x3), \
            ret, 2);                                                       \
        ret = vsetq_lane_f32(                                              \
            vgetq_lane_f32(vreinterpretq_f32_m128(b), ((imm) >> 6) & 0x3), \
            ret, 3);                                                       \
        SSE2NEON_RETURN vreinterpretq_m128_f32(ret);

#if defined(SSE2NEON_TEMPLATE_IMPL)

template<int imm>
FORCE_INLINE __m128 _mm_shuffle_ps_default_impl(__m128 a, __m128 b)
{
    _mm_shuffle_ps_default_body(a, b, imm);
}
#define _mm_shuffle_ps_default(a, b, imm) _mm_shuffle_ps_default_impl<imm>(a, b)

#elif defined(SSE2NEON_LAMBDA_IMPL)

#define _mm_shuffle_ps_default(a, b, imm) \
    [](__m128 _a, __m128 _b) { _mm_shuffle_ps_default_body(_a, _b, imm) }(a, b)

#elif defined(SSE2NEON_STATEMENT_EXPRESSION_IMPL)

#define _mm_shuffle_ps_default(a, b, imm) ({ _mm_shuffle_ps_default_body(a, b, imm) })

#endif

Oct 22 '22 22:10 invertego

Here's _mm_shuffle_ps_default as an example. I have included both template and lambda implementations for demonstration purposes, but we only need to pick one. Assume SSE2NEON_RETURN is defined as return or nothing depending on whether statement expressions are in use.

@AymenQ, Can you offer any recommendations for the aforementioned template and lambda-based implementations?

Oct 23 '22 04:10 jserv

Hi, sure. I am not at all familiar with MSVC, but I can give my thoughts:

I think the lambda-based approach is probably least disruptive and easiest to introduce. You can capture the macro arguments in the lambda via a capture-default instead, then you can maybe do something like this:

#if defined(__GNUC__) || defined(__clang__)
#define _sse2neon_define(body) __extension__({body})
#define _sse2neon_return(ret) (ret)
#else
#define _sse2neon_define(body) [=](){body}()
#define _sse2neon_return(ret) return ret
#endif

#define _mm_srli_si128(a, imm)                                       \
    _sse2neon_define(                                                \
        int8x16_t ret;                                               \
        if (_sse2neon_unlikely((imm) & ~15))                         \
            ret = vdupq_n_s8(0);                                     \
        else                                                         \
            ret = vextq_s8(vreinterpretq_s8_m128i(a), vdupq_n_s8(0), \
                           (imm > 15 ? 0 : imm));                    \
       _sse2neon_return(vreinterpretq_m128i_s8(ret));                \
    )

i.e. replace __extension__({body}) with _sse2neon_define(body) everywhere and introduce _sse2neon_return as suggested, or even possibly supply the return value as a second macro argument instead of using _sse2neon_return.

Compiles to this: https://godbolt.org/z/vjvzc4rsW

There are these unnecessary (given that they are always inlined) lambda definitions that are not eliminated at compile time, even at /O2. These get eliminated at link time anyway but it's perhaps worth considering the template-based option if this is an issue.

As for inlining itself, it could be worth also adding [[msvc::forceinline]] to the lambda, though I'm not sure whether MSVC actually respects this (latest MSVC warns that forceinline requires /std:c++20) and it probably isn't necessary anyway.

Oct 24 '22 18:10 AymenQ

I implemented something very similar to begin with. Sadly, default captures do not work if the immediate expression includes variables, which is too restrictive. Try modifying your example above to see what I mean:

__m128i foo(__m128i a) {
    const int i = 11;
    return _mm_srli_si128(a, i);
}

To make lambdas work, any variables used in the immediate expression cannot be captured.

Of course, it's still possible to use macros to keep things tidy. Here is an example that conditionally supports templates, lambdas, and statement expressions at compile time.

Helpers:

#if defined(SSE2NEON_TEMPLATE_IMPL)
#define SSE2NEON_CALL_IMPL2(type, name, a, b, imm) name##_impl<imm>(a, b)
#define SSE2NEON_DECL_IMPL2(type, name) \
    template<int imm> \
    FORCE_INLINE type name##_impl(type a, type b) { name##_body(a, b, imm) }
#elif defined(SSE2NEON_LAMBDA_IMPL)
#define SSE2NEON_CALL_IMPL2(type, name, a, b, imm) [](type a_, type b_) { name##_body(a_, b_, imm) }(a, b)
#define SSE2NEON_DECL_IMPL2(type, name)
#elif defined(SSE2NEON_STATEMENT_EXPRESSION_IMPL)
#define SSE2NEON_CALL_IMPL2(type, name, a, b, imm) __extension__({ name##_body(a, b, imm) })
#define SSE2NEON_DECL_IMPL2(type, name)
#endif

Usage:

#define _mm_insert_ps_body(a, b, imm8)                                         \
        float32x4_t tmp1 =                                                     \
            vsetq_lane_f32(vgetq_lane_f32(b, (imm8 >> 6) & 0x3),               \
                           vreinterpretq_f32_m128(a), 0);                      \
        float32x4_t tmp2 =                                                     \
            vsetq_lane_f32(vgetq_lane_f32(tmp1, 0), vreinterpretq_f32_m128(a), \
                           ((imm8 >> 4) & 0x3));                               \
        const uint32_t data[4] = {((imm8) & (1 << 0)) ? UINT32_MAX : 0,        \
                                  ((imm8) & (1 << 1)) ? UINT32_MAX : 0,        \
                                  ((imm8) & (1 << 2)) ? UINT32_MAX : 0,        \
                                  ((imm8) & (1 << 3)) ? UINT32_MAX : 0};       \
        uint32x4_t mask = vld1q_u32(data);                                     \
        float32x4_t all_zeros = vdupq_n_f32(0);                                \
                                                                               \
        SSE2NEON_RETURN vreinterpretq_m128_f32(                                \
            vbslq_f32(mask, all_zeros, vreinterpretq_f32_m128(tmp2)));

#define _mm_insert_ps(a, b, imm) \
    SSE2NEON_CALL_IMPL2(__m128, _mm_insert_ps, a, b, imm)

SSE2NEON_DECL_IMPL2(__m128, _mm_insert_ps)

Oct 24 '22 19:10 invertego

Notably, the DECL macro above is only required by the template implementation. It is possible to get rid of the CALL macro too with the lambda implementation, but it requires the statement expression implementation to declare some additional variables. Also, passing the body of the function as a macro argument requires that you be extra careful with commas in certain places (hence the _sse2neon_init macro below).

I'm not sure these tradeoffs are worth it just to get rid of an extra #define per function.

#if defined(__GNUC__) || defined(__clang__)
#define _sse2neon_define2(type, a, b, body) __extension__({type _a = (a), _b = (b); body})
#define _sse2neon_return(ret) (ret)
#else
#define _sse2neon_define2(type, a, b, body) [](type _a, type _b){body}((a), (b))
#define _sse2neon_return(ret) return ret
#endif

#define _sse2neon_init(...) { __VA_ARGS__ }

#define _mm_insert_ps(a, b, imm8)                                              \
    _sse2neon_define2(__m128, a, b,                                            \
        float32x4_t tmp1 =                                                     \
            vsetq_lane_f32(vgetq_lane_f32(_b, (imm8 >> 6) & 0x3),              \
                           vreinterpretq_f32_m128(_a), 0);                     \
        float32x4_t tmp2 =                                                     \
            vsetq_lane_f32(vgetq_lane_f32(tmp1, 0), vreinterpretq_f32_m128(_a),\
                           ((imm8 >> 4) & 0x3));                               \
        const uint32_t data[4] = _sse2neon_init(                               \
                                   ((imm8) & (1 << 0)) ? UINT32_MAX : 0,       \
                                   ((imm8) & (1 << 1)) ? UINT32_MAX : 0,       \
                                   ((imm8) & (1 << 2)) ? UINT32_MAX : 0,       \
                                   ((imm8) & (1 << 3)) ? UINT32_MAX : 0);      \
        uint32x4_t mask = vld1q_u32(data);                                     \
        float32x4_t all_zeros = vdupq_n_f32(0);                                \
                                                                               \
        _sse2neon_return(vreinterpretq_m128_f32(                               \
            vbslq_f32(mask, all_zeros, vreinterpretq_f32_m128(tmp2))));        \
    )

Oct 25 '22 08:10 invertego

Notably, the DECL macro above is only required by the template implementation. It is possible to get rid of the CALL macro too with the lambda implementation, but it requires the statement expression implementation to declare some additional variables. Also, passing the body of the function as a macro argument requires that you be extra careful with commas in certain places (hence the _sse2neon_init macro below). I'm not sure these tradeoffs are worth it just to get rid of an extra #define per function.

For MSVC integration, can we make use of recursive macros (such as SSE2NEON_EVAL) to generalize the form of macro _sse2neon_define? Therefore, we don't have to specify the number of arguments which are expected to be passed to these function-like macros. I think the proposed _sse2neon_define2 is pretty well at first glance, and my only concern is the suffix 2.

Oct 30 '22 14:10 jserv

Hi All,

I am looking into blender for MSVC/WoA targets, and it uses sse2neon under the hood in some places, which would give reasonable performence gains - was there any movement on this?

Jan 10 '23 13:01 anthony-linaro

@anthony-linaro I implemented what I described above in this branch: https://github.com/invertego/sse2neon/tree/msvc

It needs a rebase on master and currently lacks support for the pcmpstr intrinsics. They are implemented using a bunch of preprocessor black magic that doesn't work with MSVC.

I'm unlikely to revisit this soon, so if you or someone else wants to run with it, feel free.

Jan 11 '23 04:01 invertego

For those with a vested interest in this thread, I have made a pull request here: https://github.com/DLTcollab/sse2neon/pull/596

May 17 '23 14:05 anthony-linaro

Yaaaaay!

May 23 '23 13:05 Xottab-DUTY

sse2neon sse2neon copied to clipboard

MSVC support

sse2neon
sse2neon copied to clipboard