openexr icon indicating copy to clipboard operation
openexr copied to clipboard

Enable SSE4 support on Windows #2

Open darbyjohnston opened this issue 3 years ago • 11 comments

OpenEXR currently uses the macro __SSE4_1__ to determine whether to enable SSE4.1 support, for example in: https://github.com/AcademySoftwareFoundation/openexr/blob/main/src/lib/OpenEXR/ImfSimd.h

Unfortunately it looks like this macro is not defined by Visual Studio and the function __cpuid() should be used instead: https://docs.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex?view=msvc-170

This PR adds a CMake test using __cpuid() to determine if SSE4.1 support can be enabled on Windows.

darbyjohnston avatar Sep 04 '22 01:09 darbyjohnston

I think the problem is a little different.

For all platforms, the __SSE4_1__ macro at compile time will only be defined if you've compiled with the right flags to enable that ISA. (I think this is also true for Visual Studio.)

With gcc or clang, it takes the -msse4.1 (or other later architectures) to tell the compiler to define __SSE4_1__.

The -m flag doesn't work for MSVS, it's called /arch:... there, but the real kicker is that there is no way to specify for SSE4.1 only. You can say SSE2 or AVX. So I think the original code is fine, as long as the build system is changed to have an option that will correctly set /arch:AVX. Though this also potentially turns on AVX instructions, so it should only be used if you are pretty sure you'll be on a machine with AVX ISA.

Your approach in this patch involves interrogating the hardware that is doing the build, and compiling SSE4.1 instructions into the library if the hardware it's running on at that moment supports those intrinsics. I think this has some additional drawbacks:

  • You may not want SSE4 support in the library because although it's supported on the machine doing the build, it may not be on the machines that will run software linked against the library.
  • Or vice versa -- you may want to prepare the library with SSE4 support despite building on a machine without it.

lgritz avatar Sep 05 '22 06:09 lgritz

Upon looking again, I may have characterized your patch incorrectly. Looks like you use the runtime (on the build machine) availability of that ISA on the CPU only to set the default for whether or not to use SSE4.1 instructions. So you can build to run on a different type of machine than you are building on, if you know to set this option.

But I still think that having the default set in this way is likely to make behaviors that could surprise users.

In any case, I think this would be a good topic to discuss as a group, maybe most efficiently in this week's OpenEXR TSC meeting. But here is also fine if everybody would rather type.

Excuse me if I'm the one totally misunderstanding the whole issue. I'm a pretty lousy Windows programmer.

lgritz avatar Sep 05 '22 06:09 lgritz

Hi and thanks for the feedback. I'm definitely not an expert in Windows development either, this is just what I found after some searching.

For all platforms, the SSE4_1 macro at compile time will only be defined if you've compiled with the right flags to enable that ISA. (I think this is also true for Visual Studio.)

As far as I can tell the __SSE4_1__ macro is not available on Windows at all, you have to use the __cpuid() function at runtime.

With gcc or clang, it takes the -msse4.1 (or other later architectures) to tell the compiler to define SSE4_1.

How is that specified, are you passing it on the CMake command line? I did a quick search though the CMake files and didn't see that flag.

darbyjohnston avatar Sep 05 '22 21:09 darbyjohnston

How is that specified, are you passing it on the CMake command line? I did a quick search though the CMake files and didn't see that flag.

Yeah, that's exactly my point. A bigger solve is needed than just this patch.

As an example of trying to jump through the hoops fully, here's a relevant section of OSL's build scripts: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/cmake/compiler.cmake#L300 There's a cmake cache variable USE_SIMD that at build time (e.g., cmake -DUSE_SIMD=sse4.1) that is used to specify which instruction set to use. And you can see (circa line 319) that it turns the SIMD feature requests into -m for gcc and clang, or /arch: for MSVS. (Aside: I think this could be beefed up, as it doesn't understand that only some of these feature names are recognized by MS.)

The -m or /arch in turn tell the compilers to enable certain instruction sets, and also cause the compiler to define the appropriate symbols (like __SSE4_1__) so that user code knows which intrinsics are available.

So here are my claims:

  1. Without any of this logic, the ONLY reason we currently get any SIMD is that we lucked out: Turns out that all x86_64 chips support a minimum of SSE2, so __SSE2__ is always defined when building on 64 bit Intel/AMD machines.

  2. As far as I can tell (though I may be wrong), we lack a way to ask the build system to use SSE4 or beyond, and so the parts guarded by the macros for SSE4 or AVX are currently never used.

  3. Even if we had them, it's really shady on Windows using MSVS as the compiler, because the only way to request SSE4 is to actually ask for AVX, which may or may not be what you want. But we may just have to live with that, and as a practical matter, most people interested in performance are probably on AVX2 or more, so maybe we shouldn't worry about the sliver of people with machines that are SSE4 -- beyond SSE2 baseline, but still not at AVX (those must be at least 10 year old computers).

  4. You can tell if the chip you are running on supports sse4 by using cpuid(), yes, but that's meant for runtime checks and I think all you are doing is testing the capabilities of the build machine, and that can get people into trouble if they are building on a more capable machine than the hardware that will try to run the software they build.

lgritz avatar Sep 05 '22 22:09 lgritz

I like the idea of using CMake options to explicitly enable the support. It's simple, documents what's available, and avoids the issue you bring up of compiling for different machines. I would be happy to redo this PR similar to the OSL approach or attend one of the TSC meetings if you want to discuss it further.

darbyjohnston avatar Sep 05 '22 22:09 darbyjohnston

OK, I've made the changes we talked about at the TSC meeting:

  • Use CMake options for enabling SIMD support, defaulted to OFF
  • Add a few SIMD enabled runs to the CI

I also added a small test that prints out which SIMD options were enabled and what is detected at runtime. Interestingly there was an existing CpuId class in ImfSystemSpecific.h/.cpp that looks like it's being used for runtime detection of SIMD support in the DWA compression code. Maybe a follow up PR could also use this class to do runtime detection for the SSE support in the ZIP code. I also noticed there was some SIMD code in OpenEXRCore but I wasn't sure if that should be updated as well?

darbyjohnston avatar Sep 10 '22 02:09 darbyjohnston

Here are a couple performance comparisons on Windows 10 with a Ryzen 3900X compiled with VS 2022. The utility exrmakepreview was run three times and the results averaged.

SIMD options disabled:

  • 4096x2160 RGB file, ZIP compression: 0.6149007 seconds
  • 4096x2160 RGB file, DWAA compression: 0.28937083 seconds

SIMD options enabled:

  • 4096x2160 RGB file, ZIP compression: 0.4292118 seconds
  • 4096x2160 RGB file, DWAA compression: 0.2162931 seconds

darbyjohnston avatar Sep 11 '22 21:09 darbyjohnston

Nice! This is definitely a big boost for Windows users who, we now think, had never been experiencing the SIMD speedup that had been enabled by default on Linux & OSX when running on x86_64 compatible CPUs.

Are you further able to see any measurable difference between sse2 and sse4.1?

lgritz avatar Sep 12 '22 02:09 lgritz

Sorry, SSE2 support was previously enabled by default on Windows but not SSE4. Those numbers were showing the difference between SIMD being completely disabled and enabled.

Here are the numbers with just SSE2 enabled:

4096x2160 RGB file, ZIP compression: 0.5266201 seconds
4096x2160 RGB file, DWAA compression: 0.2167863 seconds

The only SSE4 code is in the ZIP decompressor so it just provides a small speedup for reading ZIP files on Windows.

Maybe I misunderstood, did we want the SIMD options to be completely off by default for portability? SSE2 and SSE4 both seem old enough they are safe assumptions, and with a bit more work I could add runtime checking just in case.

darbyjohnston avatar Sep 13 '22 01:09 darbyjohnston

Hi, sorry for the churn on this PR, I just updated it with more changes. I added runtime CPU detection for SSE2 and SSE4.1 in the ZIP code, similar to how the DWA code works. I also enabled SSE4.1 when the compiler is Visual Studio and the processor is x86 or x64 (which is how SSE2 currently works). I think with these changes the CMake options aren't needed? The SSE code will be compiled when possible on the host machine and enabled/disabled at runtime for the client machines, so everything happens automatically.

If these changes seem reasonable I can also add the runtime CPU detection for the only other place I think SSE is currently used, in ImfScanLineInputFile.cpp.

darbyjohnston avatar Sep 19 '22 00:09 darbyjohnston

Thanks for taking the time to review the changes, is there anything else I need to do for the code to be merged?

darbyjohnston avatar Oct 03 '22 18:10 darbyjohnston

Apologies for the delay, and thanks for the contribution! Merging now.

cary-ilm avatar Oct 11 '22 22:10 cary-ilm