simde icon indicating copy to clipboard operation
simde copied to clipboard

enable SVML for VS2019

Open xinsun4atlas opened this issue 2 years ago • 2 comments

Hello, SIMDe only enables SVML when using intel compiler. Recent VS2019 comes with SVML as well, so it will be good to enable on that too. Thanks!

xinsun4atlas avatar Aug 26 '21 21:08 xinsun4atlas

Although I can enable SVML on VS2019 by predefine SIMDE_ARCH_X86_SVML, however, it will automatically enable AVX512 intrinsics in svml.h. The compile will fail if AVX512 are not available. It will be great if SVML intrinsics can be partially enabled according to the SSE/AVX settings, that only supported intrinsics are used.

xinsun4atlas avatar Sep 01 '21 17:09 xinsun4atlas

I ran into this too. I don't know why SIMDE_X86_SVML_NATIVE is hardcoded to imply SIMDE_X86_AVX512F_NATIVE, since there are plenty of library functions in SVML that don't depend on any AVX code.

Additionally, even if you comment out the implicit AVX512F_NATIVE and you are targeting something without AVX, then you'll need to do this as well, because the _NATIVE feature tests for these two functions are wrong:

index 81509e96..d10dcf3f 100644
--- a/simde/x86/svml.h
+++ b/simde/x86/svml.h
@@ -8825,7 +8825,7 @@ simde_mm_clog_ps (simde__m128 a) {
 SIMDE_FUNCTION_ATTRIBUTES
 simde__m256
 simde_mm256_clog_ps (simde__m256 a) {
-  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE)
+  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_AVX_NATIVE)
     return _mm256_clog_ps(a);
   #else
     simde__m256_private
@@ -8880,7 +8880,7 @@ simde_mm_csqrt_ps (simde__m128 a) {
 SIMDE_FUNCTION_ATTRIBUTES
 simde__m256
 simde_mm256_csqrt_ps (simde__m256 a) {
-  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE)
+  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_AVX_NATIVE)
     return _mm256_csqrt_ps(a);
   #else
     simde__m256_private

tycho avatar Sep 14 '21 15:09 tycho

I would happily receive a PR to fix this

mr-c avatar Feb 18 '23 10:02 mr-c

FYI, I tried this in https://github.com/simd-everywhere/simde/commit/7720b1de544c2900c4266027bb3b927ec177d4f6 and 208 tests failed: https://ci.appveyor.com/project/nemequ/simde/builds/46916083/job/82oh98b4w8ae98fm#L6413

mr-c avatar Apr 28 '23 19:04 mr-c

FYI, I tried this in 7720b1d and 208 tests failed: https://ci.appveyor.com/project/nemequ/simde/builds/46916083/job/82oh98b4w8ae98fm#L6413

@tycho @xinsun4atlas It feels like the reasons the tests fail is due to implicit enabling of AVX512 (which is not supported by the appveyor runners); can you test my branch https://github.com/simd-everywhere/simde/tree/ci/appveyor/svml on your systems?

mr-c avatar Oct 19 '23 10:10 mr-c

Surely there must be a way to mark these as "skipped" tests on hardware that doesn't support AVX512? The test runner does have counters for those types of results, after all:

Ok:                 1552
Expected Fail:      0
Fail:               222
Unexpected Pass:    0
Skipped:            0
Timeout:            0

Anyway, I don't have any desktop machines with AVX512 support. I had to resort to a recent AMD Zen 4-based laptop which did support it instead:

Ok:                 1774
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            0
Timeout:            0

That felt weird, even a recent 13th Gen Intel laptop doesn't have AVX512...

tycho avatar Oct 20 '23 02:10 tycho

@tycho Which /arch: flags did you build with in each scenario?

If you didn't use any, then that would confirm that there seems to be AVX512 opcodes emitted by the MSVC compiler when SVML intrinsics are being used. (Or the SVML library it links against is using AVX512, see the discussion in the comments at https://stackoverflow.com/a/69418832 )

mr-c avatar Oct 20 '23 07:10 mr-c

MSVC is a bit dumb and allows you to use intrinsics from any x86 feature set without any special /arch: flags.

tycho avatar Oct 20 '23 08:10 tycho

Surely there must be a way to mark these as "skipped" tests on hardware that doesn't support AVX512?

Correct, we skip running the tests completely in CI when we use /arch:AVX or higher:

https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/.appveyor.yml#L98-L105

But if a user of SIMDe is not setting /arch:AVX512 then SIMDe shouldn't be inadvertently causing AVX512 opcodes to be used

mr-c avatar Oct 20 '23 08:10 mr-c

It will be great if SVML intrinsics can be partially enabled according to the SSE/AVX settings, that only supported intrinsics are used.

It looks like this is not possible for MSVC

mr-c avatar Oct 20 '23 08:10 mr-c

Dear @xinsun4atlas

Hello, SIMDe only enables SVML when using intel compiler. Recent VS2019 comes with SVML as well, so it will be good to enable on that too. Thanks!

Well, we enable it unconditionally for the Intel Compiler, as it used to be unique to them

https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/simde/simde-features.h#L298-L302

If your system can support the MSVC bundled SVML library, you can define SIMDE_ARCH_X86_SVML to indicate this. However it seems to use some advanced AVX intrinsics, so the resulting application/library might not be very portable.

I can't find any documentation and as you can see from the conversation here we aren't having much luck with it.

mr-c avatar Oct 20 '23 08:10 mr-c

MSVC is a bit dumb and allows you to use intrinsics from any x86 feature set without any special /arch: flags.

True, but SIMDe won't use native intrinsics just because immintrin.h was included. The test errors come from outside the SVML code, like x86/avx512/shuffle even though we don't set /arch:AVX512 so something strange is really going on

mr-c avatar Oct 20 '23 08:10 mr-c

The faulting instruction in e.g. shuffle-native-c.exe is:

SIMDE_FUNCTION_ATTRIBUTES
simde__m512i
simde__m512i_from_private(simde__m512i_private v) {
00007FF76A1224F0  mov         qword ptr [rsp+8],rcx  
00007FF76A1224F5  push        rbp  
00007FF76A1224F6  sub         rsp,0A0h  
00007FF76A1224FD  lea         rbp,[rsp+60h]  
00007FF76A122502  and         rbp,0FFFFFFFFFFFFFFC0h  
  simde__m512i r;
  simde_memcpy(&r, &v, sizeof(r));
00007FF76A122506  mov         r8d,40h  
00007FF76A12250C  mov         rdx,qword ptr [&v]  
00007FF76A122514  lea         rcx,[rbp]  
00007FF76A122518  call        memcpy (07FF76A16EE56h)  
  return r;
00007FF76A12251D  vmovdqu32   zmm0,zmmword ptr [rbp]  <----
}
00007FF76A122524  add         rsp,0A0h  
00007FF76A12252B  pop         rbp  
00007FF76A12252C  ret  

So it looks like it's generating AVX512 instructions because of the simde__m512i return type. Kind of annoying.

tycho avatar Oct 20 '23 08:10 tycho

The last comment before this typedef is informative: https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/simde/x86/avx512/types.h#L506-L537

It seems like the !defined(HEDLEY_INTEL_VERSION) check is rather suspect. Since MSVC != Intel compiler, it enables it regardless of SIMDE_X86_AVX512F_NATIVE. I think that condition should just be:

#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && defined(SIMDE_X86_AVX512F_NATIVE)

Or something like that?

tycho avatar Oct 20 '23 08:10 tycho

To add on to that, this change does indeed make the tests pass on non-AVX512 hardware (and without additional /arch: flags) with MSVC 2019:

diff --git a/simde/x86/avx512/types.h b/simde/x86/avx512/types.h
index d4e5618a..09e8b21e 100644
--- a/simde/x86/avx512/types.h
+++ b/simde/x86/avx512/types.h
@@ -527,7 +527,7 @@ typedef union {
  *
  * As for the ICC check, unlike other compilers, merely using the
  * AVX-512 types causes ICC to generate AVX-512 instructions. */
-#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && (defined(SIMDE_X86_AVX512F_NATIVE) || !defined(HEDLEY_INTEL_VERSION))
+#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && defined(SIMDE_X86_AVX512F_NATIVE)
   typedef __m512 simde__m512;
   typedef __m512i simde__m512i;
   typedef __m512d simde__m512d;

tycho avatar Oct 20 '23 08:10 tycho

@tycho Can you write a test program to use `#include "simde/x86/avx512/types.h" from https://github.com/simd-everywhere/simde/tree/ci/appveyor/svml on MSVC without /arch and print the following information?

value of _MM_CMPINT_GE value of _MM_CMPINT_NLT value of SIMDE_X86_AVX512F_NATIVE value of HEDLEY_INTEL_VERSION

type of simde__m512i

value of _AVX512BF16INTRIN_H_INCLUDED value of __AVX512BF16INTRIN_H value of SIMDE_X86_AVX512BF16_NATIVE

type of simde__m256bh

value of __AVX__ value of __AVX2__ value of _MSC_VER value of __AVX512F__ value of SIMDE_ARCH_X86_AVX512F value of SIMDE_X86_AVX512F_NATIVE

mr-c avatar Oct 20 '23 08:10 mr-c

The last comment before this typedef is informative:

https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/simde/x86/avx512/types.h#L506-L537

It seems like the !defined(HEDLEY_INTEL_VERSION) check is rather suspect. Since MSVC != Intel compiler, it enables it regardless of SIMDE_X86_AVX512F_NATIVE. I think that condition should just be:

#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && defined(SIMDE_X86_AVX512F_NATIVE)

Or something like that?

Ah, I think it was meant to be && !defined(HEDLEY_INTEL_VERSION)) , AND not OR.

mr-c avatar Oct 20 '23 08:10 mr-c

hmm.. maybe not; then ICC would never get the AVX512 types..

mr-c avatar Oct 20 '23 08:10 mr-c

Yeah I don't know what the condition was trying to do. I think it should just be the condition I put in the diff, because the Windows version of the Intel compiler matches MSVC behavior.

tycho avatar Oct 20 '23 08:10 tycho

Here's the original commit: https://github.com/simd-everywhere/simde/commit/4637a303d

mr-c avatar Oct 20 '23 08:10 mr-c

So the original thinking was that checking for defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT) was enough to detect AVX512F being available.

But Intel would still define those and emit AVX512 opcodes when any AVX512 types were used. It seems that MSVC compilers have the same problem. So maybe the answer is

diff --git a/simde/x86/avx512/types.h b/simde/x86/avx512/types.h
index d4e5618a..2df43909 100644
--- a/simde/x86/avx512/types.h
+++ b/simde/x86/avx512/types.h
@@ -527,7 +527,7 @@ typedef union {
  *
  * As for the ICC check, unlike other compilers, merely using the
  * AVX-512 types causes ICC to generate AVX-512 instructions. */
-#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && (defined(SIMDE_X86_AVX512F_NATIVE) || !defined(HEDLEY_INTEL_VERSION))
+#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && (defined(SIMDE_X86_AVX512F_NATIVE) || !(defined(HEDLEY_INTEL_VERSION) || defined(HEDLEY_MSVC_VERSION)))
   typedef __m512 simde__m512;
   typedef __m512i simde__m512i;
   typedef __m512d simde__m512d;

mr-c avatar Oct 20 '23 08:10 mr-c

I think that would be wrong with clang-cl though, because that one behaves more sanely with regards to code generation for instruction sets not enabled in compiler flags.

(clang-cl calls itself MSVC, so you'd need something like !(defined(INTEL) || ((defined(MSVC) && !defined(CLANG)) or whatever it would be with the HEDLEY macros.)

tycho avatar Oct 20 '23 08:10 tycho

Thanks for the tip, @tycho !

Fingers crossed: ~~https://ci.appveyor.com/project/nemequ/simde/builds/48327075/job/3lnyo84d655gnsd8~~ ~~https://ci.appveyor.com/project/nemequ/simde/build/job/oxi8lp9bcy2l3n5u~~ https://ci.appveyor.com/project/nemequ/simde/build/job/p4w61kdgj3n9mrc3

mr-c avatar Oct 20 '23 09:10 mr-c

Success!

mr-c avatar Oct 20 '23 10:10 mr-c

@mr-c By the way, if you're curious, here's what clang (and clang-cl) would do on Windows with __m512 arguments and return type:

https://gcc.godbolt.org/z/qv4eMqWWa

Trying to use an AVX512 intrinsic is only allowed by clang/clang-cl if the appropriate feature is enabled too:

https://gcc.godbolt.org/z/a37vzjPa3

If only this behavior was standard for MSVC as well...

tycho avatar Oct 20 '23 17:10 tycho

Great to know, thank you!

If you have a bit of time, it would be helpful to add ClangCL to our testing setup. Your choice of CI provider 😊🙏

mr-c avatar Oct 20 '23 17:10 mr-c

Thanks for the fix! For some reasons we deprecated the project which was using this feature that I cannot test on my side. Appreciate the work and definitely will take a try if we get a chance in our new projects.

xinsun4atlas avatar Oct 20 '23 18:10 xinsun4atlas