sse2neon
sse2neon copied to clipboard
MSVC support
MSVC support will be nice.
Also, is it worth the effort, if I make a pull request with some changes that will improve the compatibility with MSVC? (I know almost nothing about ARM, but I can try to fix some macros, etc, maybe small things, if no one mind about the idea)
MSVC support will be nice. Also, is it worth the effort, if I make a pull request with some changes that will improve the compatibility with MSVC? (I know almost nothing about ARM, but I can try to fix some macros, etc, maybe small things, if no one mind about the idea)
Welcome! Can you enable GitHub Actions integration with Visual Studio? See https://devblogs.microsoft.com/visualstudio/whats-new-with-github-actions-tooling-in-visual-studio/
When I try to compile sse2neon for ARM Windows using MSVC, it triggers the #error "Macro name collisions may happen with unsupported compiler."
line in the below code, which fails the build:
https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h#L83
#if defined(__GNUC__) || defined(__clang__)
#pragma push_macro("FORCE_INLINE")
#pragma push_macro("ALIGN_STRUCT")
#define FORCE_INLINE static inline __attribute__((always_inline))
#define ALIGN_STRUCT(x) __attribute__((aligned(x)))
#ifndef likely
#define likely(x) __builtin_expect(!!(x), 1)
#endif
#ifndef unlikely
#define unlikely(x) __builtin_expect(!!(x), 0)
#endif
#else
#error "Macro name collisions may happen with unsupported compiler."
#ifdef FORCE_INLINE
#undef FORCE_INLINE
#endif
#define FORCE_INLINE static inline
#ifndef ALIGN_STRUCT
#define ALIGN_STRUCT(x) __declspec(align(x))
#endif
#endif
#ifndef likely
#define likely(x) (x)
#endif
#ifndef unlikely
#define unlikely(x) (x)
#endif
To fix this, sse2neon should add a section that starts with #elif defined(_MSC_VER)
to check for MSVC and have code that works for MSVC. When I try to comment out the #error
line it does not work, I get hundreds of other errors instead.
Line 83 is indeed #warning
rather than #error
. At present, SSE2NEON is built with clang and gcc.
@aaronfranke, Can you remove the part from Line 105 to Line 120 and build with MSVC?
@jserv When I comment out the error line and lines 105 to 120, I get hundreds of errors. Here are some of the first ones:
godot\thirdparty\embree\common\simd\arm\sse2neon.h(465): error C3861: '__builtin_prefetch': identifier not found
godot\thirdparty\embree\common\simd\arm\sse2neon.h(475): error C2065: '__asm__': undeclared identifier
godot\thirdparty\embree\common\simd\arm\sse2neon.h(475): error C2146: syntax error: missing ';' before identifier '__volatile__'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(475): error C3861: '__volatile__': identifier not found
godot\thirdparty\embree\common\simd\arm\sse2neon.h(826): error C2440: 'type cast': cannot convert from '__m64' to 'int64_t'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(826): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): error C2440: 'type cast': cannot convert from '__m64' to 'int64_t'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): error C2440: 'type cast': cannot convert from '__m64' to 'int64_t'
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
godot\thirdparty\embree\common\simd\arm\sse2neon.h(858): error C2660: '_mm_set_epi64x': function does not take 1 arguments
godot\thirdparty\embree\common\simd\arm\sse2neon.h(847): note: see declaration of '_mm_set_epi64x'
The biggest challenge so far is the result of this commit: 8c24b9458989e87776532ab209342723e2d568ed
MSVC does not support statement expressions, as they are a GNU C extension. Because intrinsics that take immediates require them to be constant expressions, we need to come up with some adequate substitute. So far, I can see only two options:
- function templates that take the immediate as a template parameter (i.e. how the old implementation worked)
- lambda functions that do not explicitly capture the immediate
Of course, both of these rely on C++. Is it acceptable to require compiling as C++ when using MSVC?
I can see only two options:
- function templates that take the immediate as a template parameter (i.e. how the old implementation worked)
- lambda functions that do not explicitly capture the immediate
Of course, both of these rely on C++. Is it acceptable to require compiling as C++ when using MSVC?
Let's try to introduce some wrapper based on C++ function templates for MSCV support. @invertego, can you show some proposed changes?
Here's _mm_shuffle_ps_default
as an example. I have included both template and lambda implementations for demonstration purposes, but we only need to pick one. Assume SSE2NEON_RETURN
is defined as return
or nothing depending on whether statement expressions are in use.
#define _mm_shuffle_ps_default_body(a, b, imm) \
float32x4_t ret; \
ret = vmovq_n_f32( \
vgetq_lane_f32(vreinterpretq_f32_m128(a), (imm) & (0x3))); \
ret = vsetq_lane_f32( \
vgetq_lane_f32(vreinterpretq_f32_m128(a), ((imm) >> 2) & 0x3), \
ret, 1); \
ret = vsetq_lane_f32( \
vgetq_lane_f32(vreinterpretq_f32_m128(b), ((imm) >> 4) & 0x3), \
ret, 2); \
ret = vsetq_lane_f32( \
vgetq_lane_f32(vreinterpretq_f32_m128(b), ((imm) >> 6) & 0x3), \
ret, 3); \
SSE2NEON_RETURN vreinterpretq_m128_f32(ret);
#if defined(SSE2NEON_TEMPLATE_IMPL)
template<int imm>
FORCE_INLINE __m128 _mm_shuffle_ps_default_impl(__m128 a, __m128 b)
{
_mm_shuffle_ps_default_body(a, b, imm);
}
#define _mm_shuffle_ps_default(a, b, imm) _mm_shuffle_ps_default_impl<imm>(a, b)
#elif defined(SSE2NEON_LAMBDA_IMPL)
#define _mm_shuffle_ps_default(a, b, imm) \
[](__m128 _a, __m128 _b) { _mm_shuffle_ps_default_body(_a, _b, imm) }(a, b)
#elif defined(SSE2NEON_STATEMENT_EXPRESSION_IMPL)
#define _mm_shuffle_ps_default(a, b, imm) ({ _mm_shuffle_ps_default_body(a, b, imm) })
#endif
Here's
_mm_shuffle_ps_default
as an example. I have included both template and lambda implementations for demonstration purposes, but we only need to pick one. AssumeSSE2NEON_RETURN
is defined asreturn
or nothing depending on whether statement expressions are in use.
@AymenQ, Can you offer any recommendations for the aforementioned template and lambda-based implementations?
Hi, sure. I am not at all familiar with MSVC, but I can give my thoughts:
I think the lambda-based approach is probably least disruptive and easiest to introduce. You can capture the macro arguments in the lambda via a capture-default instead, then you can maybe do something like this:
#if defined(__GNUC__) || defined(__clang__)
#define _sse2neon_define(body) __extension__({body})
#define _sse2neon_return(ret) (ret)
#else
#define _sse2neon_define(body) [=](){body}()
#define _sse2neon_return(ret) return ret
#endif
#define _mm_srli_si128(a, imm) \
_sse2neon_define( \
int8x16_t ret; \
if (_sse2neon_unlikely((imm) & ~15)) \
ret = vdupq_n_s8(0); \
else \
ret = vextq_s8(vreinterpretq_s8_m128i(a), vdupq_n_s8(0), \
(imm > 15 ? 0 : imm)); \
_sse2neon_return(vreinterpretq_m128i_s8(ret)); \
)
i.e. replace __extension__({body})
with _sse2neon_define(body)
everywhere and introduce
_sse2neon_return
as suggested, or even possibly supply the return value as a second
macro argument instead of using _sse2neon_return
.
Compiles to this: https://godbolt.org/z/vjvzc4rsW
There are these unnecessary (given that they are always inlined) lambda definitions that are not eliminated at compile time, even at /O2. These get eliminated at link time anyway but it's perhaps worth considering the template-based option if this is an issue.
As for inlining itself, it could be worth also adding [[msvc::forceinline]]
to the
lambda, though I'm not sure whether MSVC actually respects this (latest MSVC
warns that forceinline
requires /std:c++20
) and it probably isn't necessary
anyway.
I implemented something very similar to begin with. Sadly, default captures do not work if the immediate expression includes variables, which is too restrictive. Try modifying your example above to see what I mean:
__m128i foo(__m128i a) {
const int i = 11;
return _mm_srli_si128(a, i);
}
To make lambdas work, any variables used in the immediate expression cannot be captured.
Of course, it's still possible to use macros to keep things tidy. Here is an example that conditionally supports templates, lambdas, and statement expressions at compile time.
Helpers:
#if defined(SSE2NEON_TEMPLATE_IMPL)
#define SSE2NEON_CALL_IMPL2(type, name, a, b, imm) name##_impl<imm>(a, b)
#define SSE2NEON_DECL_IMPL2(type, name) \
template<int imm> \
FORCE_INLINE type name##_impl(type a, type b) { name##_body(a, b, imm) }
#elif defined(SSE2NEON_LAMBDA_IMPL)
#define SSE2NEON_CALL_IMPL2(type, name, a, b, imm) [](type a_, type b_) { name##_body(a_, b_, imm) }(a, b)
#define SSE2NEON_DECL_IMPL2(type, name)
#elif defined(SSE2NEON_STATEMENT_EXPRESSION_IMPL)
#define SSE2NEON_CALL_IMPL2(type, name, a, b, imm) __extension__({ name##_body(a, b, imm) })
#define SSE2NEON_DECL_IMPL2(type, name)
#endif
Usage:
#define _mm_insert_ps_body(a, b, imm8) \
float32x4_t tmp1 = \
vsetq_lane_f32(vgetq_lane_f32(b, (imm8 >> 6) & 0x3), \
vreinterpretq_f32_m128(a), 0); \
float32x4_t tmp2 = \
vsetq_lane_f32(vgetq_lane_f32(tmp1, 0), vreinterpretq_f32_m128(a), \
((imm8 >> 4) & 0x3)); \
const uint32_t data[4] = {((imm8) & (1 << 0)) ? UINT32_MAX : 0, \
((imm8) & (1 << 1)) ? UINT32_MAX : 0, \
((imm8) & (1 << 2)) ? UINT32_MAX : 0, \
((imm8) & (1 << 3)) ? UINT32_MAX : 0}; \
uint32x4_t mask = vld1q_u32(data); \
float32x4_t all_zeros = vdupq_n_f32(0); \
\
SSE2NEON_RETURN vreinterpretq_m128_f32( \
vbslq_f32(mask, all_zeros, vreinterpretq_f32_m128(tmp2)));
#define _mm_insert_ps(a, b, imm) \
SSE2NEON_CALL_IMPL2(__m128, _mm_insert_ps, a, b, imm)
SSE2NEON_DECL_IMPL2(__m128, _mm_insert_ps)
Notably, the DECL macro above is only required by the template implementation. It is possible to get rid of the CALL macro too with the lambda implementation, but it requires the statement expression implementation to declare some additional variables. Also, passing the body of the function as a macro argument requires that you be extra careful with commas in certain places (hence the _sse2neon_init macro below).
I'm not sure these tradeoffs are worth it just to get rid of an extra #define per function.
#if defined(__GNUC__) || defined(__clang__)
#define _sse2neon_define2(type, a, b, body) __extension__({type _a = (a), _b = (b); body})
#define _sse2neon_return(ret) (ret)
#else
#define _sse2neon_define2(type, a, b, body) [](type _a, type _b){body}((a), (b))
#define _sse2neon_return(ret) return ret
#endif
#define _sse2neon_init(...) { __VA_ARGS__ }
#define _mm_insert_ps(a, b, imm8) \
_sse2neon_define2(__m128, a, b, \
float32x4_t tmp1 = \
vsetq_lane_f32(vgetq_lane_f32(_b, (imm8 >> 6) & 0x3), \
vreinterpretq_f32_m128(_a), 0); \
float32x4_t tmp2 = \
vsetq_lane_f32(vgetq_lane_f32(tmp1, 0), vreinterpretq_f32_m128(_a),\
((imm8 >> 4) & 0x3)); \
const uint32_t data[4] = _sse2neon_init( \
((imm8) & (1 << 0)) ? UINT32_MAX : 0, \
((imm8) & (1 << 1)) ? UINT32_MAX : 0, \
((imm8) & (1 << 2)) ? UINT32_MAX : 0, \
((imm8) & (1 << 3)) ? UINT32_MAX : 0); \
uint32x4_t mask = vld1q_u32(data); \
float32x4_t all_zeros = vdupq_n_f32(0); \
\
_sse2neon_return(vreinterpretq_m128_f32( \
vbslq_f32(mask, all_zeros, vreinterpretq_f32_m128(tmp2)))); \
)
Notably, the DECL macro above is only required by the template implementation. It is possible to get rid of the CALL macro too with the lambda implementation, but it requires the statement expression implementation to declare some additional variables. Also, passing the body of the function as a macro argument requires that you be extra careful with commas in certain places (hence the _sse2neon_init macro below). I'm not sure these tradeoffs are worth it just to get rid of an extra #define per function.
For MSVC integration, can we make use of recursive macros (such as SSE2NEON_EVAL
) to generalize the form of macro _sse2neon_define
? Therefore, we don't have to specify the number of arguments which are expected to be passed to these function-like macros. I think the proposed _sse2neon_define2
is pretty well at first glance, and my only concern is the suffix 2
.
Hi All,
I am looking into blender for MSVC/WoA targets, and it uses sse2neon under the hood in some places, which would give reasonable performence gains - was there any movement on this?
@anthony-linaro I implemented what I described above in this branch: https://github.com/invertego/sse2neon/tree/msvc
It needs a rebase on master and currently lacks support for the pcmpstr intrinsics. They are implemented using a bunch of preprocessor black magic that doesn't work with MSVC.
I'm unlikely to revisit this soon, so if you or someone else wants to run with it, feel free.
For those with a vested interest in this thread, I have made a pull request here: https://github.com/DLTcollab/sse2neon/pull/596
Yaaaaay!