godot Use SSE 4.2 as a baseline when compiling Godot

This lets the compiler do more optimizations, leading to increased performance for demanding CPU tasks. This should be beneficial to occlusion culling rasterization, physics and more.

This change only affects x86 platforms.

This is considered a breaking change, as very old CPUs will not be able to run official Godot binaries on releases made after this is merged. On the Intel side, SSE4.2 has been supported on all Intel CPUs for a long time now (Nehalem, released in Q4 2008). However, AMD CPUs have only supported SSE4.2 since 2011 (Bulldozer, excluding APUs). It's unlikely that CPUs this old are paired with GPUs that support Vulkan or OpenGL 3.3 well anyway. This is particularly the case for old AMD CPUs which haven't aged well due to their lower single-core performance compared to Intel, particularly at the time.

This closes https://github.com/godotengine/godot-proposals/issues/3932.

elfx86exts reports for old and new release export templates:

Instructions in the binary

Current

❯ elfx86exts godot.linuxbsd.opt.64 | sort
CPU Generation: Haswell
AES (aesenc)
BMI2 (shlx)
BMI (tzcnt)
CMOV (cmovle)
MMX (movq)
MODE64 (call)
PCLMUL (pclmulqdq)
SSE1 (movups)
SSE2 (movdqu)

With the above branch

❯ elfx86exts godot.linuxbsd.opt.64.sse4.2 | sort
CPU Generation: Unknown
CMOV (cmovs)
MODE64 (ret)
SSE1 (movss)
SSE2 (pxor)
SSE3 (lddqu)
SSE41 (roundss)
SSSE3 (pshufb)

Binary sizes are almost identical, with the SSE4.2-enabled export template being 4 KB smaller when comparing the size of both binaries stripped.

Benchmark

The testing project instances 500 RigidDynamicBody3D nodes and is quit as fast as possible: test_sse4.2.zip

❯ hyperfine -iw1 "bin/godot.linuxbsd.opt.64.stripped --path ~/Documents/Godot/test_sse4.2 --quit" "bin/godot.linuxbsd.opt.64.sse4.2.stripped --path ~/Documents/Godot/test_sse4.2 --quit"
Benchmark #1: bin/godot.linuxbsd.opt.64.stripped --path ~/Documents/Godot/test_sse4.2 --quit
  Time (mean ± σ):      2.394 s ±  0.282 s    [User: 1.508 s, System: 0.165 s]
  Range (min … max):    1.605 s …  2.546 s    10 runs

Benchmark #2: bin/godot.linuxbsd.opt.64.sse4.2.stripped --path ~/Documents/Godot/test_sse4.2 --quit
  Time (mean ± σ):      2.199 s ±  0.429 s    [User: 1.499 s, System: 0.169 s]
  Range (min … max):    1.578 s …  2.544 s    10 runs

Summary
  'bin/godot.linuxbsd.opt.64.sse4.2.stripped --path ~/Documents/Godot/test_sse4.2 --quit' ran
    1.09 ± 0.25 times faster than 'bin/godot.linuxbsd.opt.64.stripped --path ~/Documents/Godot/test_sse4.2 --quit

Mar 27 '22 18:03 Calinou

can we enable SSE4+ on MSVC? may be need env.Append(CCFLAGS=["/arch:AVX", "/arch:AVX2", "/arch:AVX512", "/d2archSSE42"]) from MSVC and env.Append(CPPDEFINES=[ "__SSE4_1__"]) for thirdparty\etcpak

Sep 14 '22 09:09 Valeryn4

can we enable SSE4+ on MSVC?

This should be doable, but we decided to put this PR on hold for now to explore dynamic branching at run-time instead. This will also allow using AVX and AVX2 on supported CPUs for further performance gains.

Sep 14 '22 18:09 Calinou

My attempt from build with AVX2 and AVX512 did not work out because of the "thirdparty/embree" and "thirdparty/etcpak"

Sep 16 '22 01:09 Valeryn4

Magnum developers published an article on dynamic dispatch which is worth a read: https://blog.magnum.graphics/backstage/cpu-feature-detection-dispatch/

Nov 01 '22 21:11 Calinou

EDIT: these were broken because it didn't check the binary executed. One moment.

Windows results:

CPU: AMD Ryzen 5 4600 6-Core Process
64 Bit Windows 11

938.53 ± 118.13 times faster than godot.windows.template_release.x86_64.master.exe --path test_sse42/ --quit

hyperfine -iw1 "godot.windows.template_release.x86_64.exe --path test_sse42/ --quit" "godot.windows.template_release.x86_64.master.exe --path test_sse42/ --quit"
Benchmark 1: godot.windows.template_release.x86_64.exe --path test_sse42/ --quit
  Time (mean ± σ):       1.8 ms ±   0.2 ms    [User: 0.3 ms, System: 1.2 ms]
  Range (min … max):     1.3 ms …   2.6 ms    364 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Ignoring non-zero exit code.

Benchmark 2: godot.windows.template_release.x86_64.master.exe --path test_sse42/ --quit
  Time (mean ± σ):      1.733 s ±  0.035 s    [User: 0.994 s, System: 0.080 s]
  Range (min … max):    1.710 s …  1.822 s    10 runs

  Warning: Ignoring non-zero exit code.
  Warning: The first benchmarking run for this command was significantly slower than the rest (1.822 s). This could be caused by (filesystem) caches that were not filled until after the first run. You are already using the '--warmup' option which helps to fill these caches before the actual benchmark. You can either try to increase the warmup count further or re-run this benchmark on a quiet system in case it was a random outlier. Alternatively, consider using the '--prepare' option to clear the caches before each timing run.

Summary
  godot.windows.template_release.x86_64.exe --path test_sse42/ --quit ran
  938.53 ± 118.13 times faster than godot.windows.template_release.x86_64.master.exe --path test_sse42/ --quit

Oct 02 '23 21:10 RevoluPowered

Windows results:

CPU: AMD Ryzen 5 4600 6-Core Process
OS: 64 Bit Windows 11

Command executed:

hyperfine -iw1 "godot.windows.template_release.x86_64.faster.exe --path test_sse42/ --quit" "godot.windows.template_release.x86_64.master.exe --path test_sse42/ --quit"

Results (SSE is slower, I guess it might not be auto-vectorizing?):

Tested again in windows powershell and master also ran faster:

CPU Features

Oct 02 '23 21:10 RevoluPowered

I moved the camera and generated the .godot/ folder in the benchmark project, this results in getting a performance benefit on windows.

test_sse42.zip (i recommend using this because you don't need to open the editor then and I modified the camera direction)

I spent a bit of extra time and made a patch for testing out AVX2 and the results were OK ish.

I believe with a real game the benefits will show up more, adding a box, I don't think will tax things very much but writing a lot of data like in a game will. I can test this out on The Mirror tomorrow. I spent about 4 hours trying a few different things to improve it.

Oct 02 '23 23:10 RevoluPowered

What are the considerations for enabling SSE 4.2 in Embree in Godot, are you guys still working on it?

Nov 06 '23 17:11 jams3223

What are the considerations for enabling SSE 4.2 in Embree in Godot, are you guys still working on it?

It has the same concerns as mentioned above – even if you enable SSE 4.2 only for Embree, it impacts all users of the engine even if they don't use occlusion culling or LOD generation.

Nov 06 '23 18:11 Calinou

Rebased and tested again, it works as expected.

May 24 '24 16:05 Calinou

Isn't it easier and more semantically correct to just use x86_64 psABI -march=x86_64-v2 which available from gcc 11 and llvm 12?

If compiler is older there is no any difficulties to just use fallback -msse4.1 as currently done in PR but future readers will easier understand why only these instructions are selected. In case of using psABI levels those comments with links will be useless too. And just as TODO, after this PR the docs minimum requirements should be updated

P.S I fully support this change. It is hard to find processor which doesn't even support x86_64-v2 which this PR targets

Oct 23 '24 13:10 dustdfg

I've just thought... I fully agree that plain x86-64 are very old and we need to target x86-64-v2 but there is "compatibility" problem. Yeah it is unlike that such old processors will be part of the system for godot4 game but recently in chat we've discussed that godot can also be used for different kind of apps. For example GodSVG, I really don't think that in such cases forcing x86-64-v2 is optimal. I assume there can be veeery old devices with only x86-64 but with opengl3.3 which would run GodSVG without much problems... I though about solution to keep "compatibility" and I think I've found one. Currently optimize option allows custom value with which SCons doesn't add any optimization flags and gives user full control on what user wants to do about optimization. All these instruction set specific options (-msse4.2 and etc) could be seen as "optimization flags". So I can propose to wrap all these flags in a check if env["optimize"] != "custom"

Nov 03 '24 13:11 dustdfg

I assume there can be veeery old devices with only x86-64 but with opengl3.3 which would run GodSVG without much problems...

Not really, unless you pair a pre-Nehalem/Bulldozer CPU with a somewhat modern GPU (which is very unlikely). I've never seen anyone with such a setup, as you'd be CPU-limited in every game with that kind of setup.

We could add a SCons option to control the intrinsics flags used in a cross-platform manner, but this should be done in a separate PR.

Nov 03 '24 13:11 Calinou

Not really, unless you pair a pre-Nehalem/Bulldozer CPU with a somewhat modern GPU (which is very unlikely). I've never seen anyone with such a setup, as you'd be CPU-limited in every game with that kind of setup.

Don't their embedded graphics support gl3.3? If there was embedded graphics...

Nov 03 '24 13:11 dustdfg

Don't their embedded graphics support gl3.3? If there was embedded graphics...

No, only Direct3D 9 and OpenGL 1.x most of the time (and AMD didn't have any integrated graphics back then). Their performance was also really low (they made Intel HD Graphics look fast in comparison).

Nov 03 '24 14:11 Calinou

No, only Direct3D 9 and OpenGL 1.x most of the time (and AMD didn't have any integrated graphics back then). Their performance was also really low (they made Intel HD Graphics look fast in comparison).

Yeah, I've googled and found that overestimated those processors but I don't understand what do you mean when say "modern GPU". Vulkan isn't required when you run with compatibility renderer. Now modern GPU means gl4.6 and at least vulkan 1.0, usually 1.1 or or 1.2. But if we are talking only about gl3.3 it was supported on old GPUs.

Look: Intel Core (pre Nehallem) supports +/- everything from x86-64-v2 except SSE4.1 and SSE4.2 (2006 year) + ATI Radeon HD 2000 support opengl 3.3 (2007 year). Godot 4 should work in this setup. I personally want this PR to be merged (sooner is better) but I see use case where it will brake compatibility. I fully agree that by default it should be used even in release build for redistribution but I don't understand what you mean by modern GPU + old CPU

Nov 03 '24 14:11 dustdfg

I ran across https://github.com/godotengine/godot/issues/58463, seems like there are still people with pre- SSE4.2 CPUs around (core2 supported just msse3 and mssse3).

Nov 13 '24 14:11 Ivorforce

diff <(g++ -Q -march=x86-64-v2 --help=target) <(g++ -Q -march=nehalem --help=target)
32c32
<   -march=                     		x86-64-v2
---
>   -march=                     		nehalem
39,40c39,40
<   -mavx256-split-unaligned-load 	[disabled]
<   -mavx256-split-unaligned-store 	[disabled]
---
>   -mavx256-split-unaligned-load 	[enabled]
>   -mavx256-split-unaligned-store 	[enabled]
217c217
<   -mtune=                     		generic
---
>   -mtune=                     		nehalem

nehalem basically is the same as x86-64-v2 (which I proposed) but with additionally allowed some operations without aligment and which is most important enables -mtune which isn't enabled if we just pass -march=x86-64-v2 or -msse4.2. Plus it is supported even on old compiler (compering to x86-64-v2)

So I fully support using -march=nehalem instead of -msee4.2

Nov 13 '24 17:11 dustdfg

I ran across #58463, seems like there are still people with pre- SSE4.2 CPUs around (core2 supported just msse3 and mssse3).

We could just not pass any flags for concrete instruction sets when user passes scons optimize="custom" to allow users with old hardware to compile godot by themselves

Nov 13 '24 17:11 dustdfg

nehalem basically is the same as x86-64-v2 (which I proposed) but with additionally allowed some operations without aligment and which is most important enables -mtune which isn't enabled if we just pass -march=x86-64-v2 or -msse4.2. Plus it is supported even on old compiler (compering to x86-64-v2)

Regarding mtune, I ran across a short explainer that implies that specifically tuning for one architecture can decrease performance on other systems, such as when tuning for core2. Therefore, I'm not sure if the 'generic' tune might not be more appropriate for us. But I can't really make a final recommendation since I don't know that much about gcc flags.

Regarding mavx256-split-unaligned-load / mavx256-split-unaligned-store, it seems to be better to have them disabled, on most systems, according to this source (and this one). Not that apple's compiler disables them by default:

 ❯ g++ -Q --help=target | grep mavx256-split-unaligned
  -mavx256-split-unaligned-load         [disabled]
  -mavx256-split-unaligned-store        [disabled]

So perhaps your suggestion of x86-64-v2 is better after all. But yeah, again, I'm not confident in making a final recommendation for this.

Nov 13 '24 18:11 Ivorforce

Note that using x86-64-v2 will prevent compiling Godot on Ubuntu 20.04 unless you use custom repositories to get a more up-to-date GCC/Clang. This isn't too much of a problem nowadays, considering 20.04 is going EOL in April 2025.

Nov 13 '24 22:11 Calinou

Note that using x86-64-v2 will prevent compiling Godot on Ubuntu 20.04 unless you use custom repositories to get a more up-to-date GCC/Clang. This isn't too much of a problem nowadays, considering 20.04 is going EOL in April 2025.

Yeah, old compilers support is the biggest problem with x86-64-v2 and higher. More the problem is we still support

march=nehalem is everywhere and it differs from x86-64-v2 only by -mtune and these probably broken -mavx256-split-unaligned-load. (One notice sse4.2 support 128 bit registers :thinking: from where nehalem option gets avx256...):

I think the best solution will be just emulating x86-64-v2. So my proposal is to use nehalem with following args: -march=nehalem -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -mtune=generic

diff <(g++ -Q -march=x86-64-v2 --help=target) <(g++ -Q -march=nehalem -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -mtune=generic --help=target)
32c32
<   -march=                     		x86-64-v2
---
>   -march=                     		nehalem

Nov 14 '24 06:11 dustdfg

i think sse 3 for 4.4 is better for now, is the min for web browsers, has similar coverage to sse 2 and sse 2 only cpus are very old for godot 4.x

also windows 11 is min sse 4.2 so when windows 10 is actually very eol switch to that, mac os sse 2 only are also rare

Dec 15 '24 22:12 octanejohn

mac os sse 2 only are also rare

macOS SSE 2 only does not exist, they entered Intel chips with SSE3 and it's been the default minimum build since.

Dec 15 '24 23:12 Ivorforce

Since Jolt has been merged by now, it may be interesting to benchmark it with an SSE4.2 build (vs baseline master). They mention both SSE4.1 and SSE4.2 support on GitHub.

Mar 21 '25 19:03 Ivorforce

Since Jolt has been merged by now, it may be interesting to benchmark it with an SSE4.2 build (vs baseline master).

Just to give an idea of potential gains, here's the results of Jolt's own PerformanceTest application, which runs a scene with a bunch of ragdolls and records how many physics steps per second it can do.

This is compiled with GCC 13, running on an AMD Ryzen 9 7940HS, Linux Mint 22.1, kernel 6.8.0-55-generic, and I also use Jolt's CROSS_PLATFORM_DETERMINISTIC=ON CMake option, in order to ensure that the exact same simulation is what's being compared (i.e. same final hash).

tl;dr: 5-10% bump between SSE2 and SSE4.2, depending on the thread count, so not a world of difference, but "free performance" none the less.

(Discrete is generally the more relevant result. LinearCast is with all dynamic bodies using CCD.)

SSE2

Single precision x86 64-bit with instructions: SSE2 (Cross Platform Deterministic) (16-bit ObjectLayer) (ObjectStream) 
Running scene: Ragdoll
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 65.881708, 0x4c312b4745789d62
Discrete, 2, 120.174217, 0x4c312b4745789d62
Discrete, 3, 172.042531, 0x4c312b4745789d62
Discrete, 4, 222.941747, 0x4c312b4745789d62
Discrete, 5, 272.351953, 0x4c312b4745789d62
Discrete, 6, 318.623601, 0x4c312b4745789d62
Discrete, 7, 359.066919, 0x4c312b4745789d62
Discrete, 8, 397.309243, 0x4c312b4745789d62
Discrete, 9, 409.699378, 0x4c312b4745789d62
Discrete, 10, 413.474930, 0x4c312b4745789d62
Discrete, 11, 423.319247, 0x4c312b4745789d62
Discrete, 12, 437.915902, 0x4c312b4745789d62
Discrete, 13, 443.396281, 0x4c312b4745789d62
Discrete, 14, 456.935729, 0x4c312b4745789d62
Discrete, 15, 463.867629, 0x4c312b4745789d62
Discrete, 16, 469.225474, 0x4c312b4745789d62
LinearCast, 1, 60.955793, 0xb6979cd9fc00610b
LinearCast, 2, 111.709579, 0xb6979cd9fc00610b
LinearCast, 3, 161.171314, 0xb6979cd9fc00610b
LinearCast, 4, 206.203765, 0xb6979cd9fc00610b
LinearCast, 5, 248.990019, 0xb6979cd9fc00610b
LinearCast, 6, 289.745637, 0xb6979cd9fc00610b
LinearCast, 7, 322.724832, 0xb6979cd9fc00610b
LinearCast, 8, 356.100314, 0xb6979cd9fc00610b
LinearCast, 9, 367.306808, 0xb6979cd9fc00610b
LinearCast, 10, 378.609721, 0xb6979cd9fc00610b
LinearCast, 11, 387.975389, 0xb6979cd9fc00610b
LinearCast, 12, 394.422775, 0xb6979cd9fc00610b
LinearCast, 13, 407.201932, 0xb6979cd9fc00610b
LinearCast, 14, 416.069317, 0xb6979cd9fc00610b
LinearCast, 15, 426.329187, 0xb6979cd9fc00610b
LinearCast, 16, 429.793067, 0xb6979cd9fc00610b

SSE4.2

Single precision x86 64-bit with instructions: SSE2 SSE4.1 SSE4.2 (Cross Platform Deterministic) (16-bit ObjectLayer) (ObjectStream) 
Running scene: Ragdoll
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 72.569562, 0x4c312b4745789d62
Discrete, 2, 131.878210, 0x4c312b4745789d62
Discrete, 3, 187.021467, 0x4c312b4745789d62
Discrete, 4, 242.919995, 0x4c312b4745789d62
Discrete, 5, 295.726138, 0x4c312b4745789d62
Discrete, 6, 344.419772, 0x4c312b4745789d62
Discrete, 7, 390.927667, 0x4c312b4745789d62
Discrete, 8, 431.382724, 0x4c312b4745789d62
Discrete, 9, 438.616877, 0x4c312b4745789d62
Discrete, 10, 448.214651, 0x4c312b4745789d62
Discrete, 11, 459.467392, 0x4c312b4745789d62
Discrete, 12, 461.936360, 0x4c312b4745789d62
Discrete, 13, 474.144970, 0x4c312b4745789d62
Discrete, 14, 483.284967, 0x4c312b4745789d62
Discrete, 15, 495.812566, 0x4c312b4745789d62
Discrete, 16, 498.158671, 0x4c312b4745789d62
LinearCast, 1, 68.208308, 0xb6979cd9fc00610b
LinearCast, 2, 124.335433, 0xb6979cd9fc00610b
LinearCast, 3, 177.414498, 0xb6979cd9fc00610b
LinearCast, 4, 225.489790, 0xb6979cd9fc00610b
LinearCast, 5, 273.307982, 0xb6979cd9fc00610b
LinearCast, 6, 318.163062, 0xb6979cd9fc00610b
LinearCast, 7, 359.391264, 0xb6979cd9fc00610b
LinearCast, 8, 388.463431, 0xb6979cd9fc00610b
LinearCast, 9, 402.590608, 0xb6979cd9fc00610b
LinearCast, 10, 414.372667, 0xb6979cd9fc00610b
LinearCast, 11, 419.937163, 0xb6979cd9fc00610b
LinearCast, 12, 423.344092, 0xb6979cd9fc00610b
LinearCast, 13, 435.714761, 0xb6979cd9fc00610b
LinearCast, 14, 443.494645, 0xb6979cd9fc00610b
LinearCast, 15, 454.158931, 0xb6979cd9fc00610b
LinearCast, 16, 458.854454, 0xb6979cd9fc00610b

AVX

Single precision x86 64-bit with instructions: SSE2 SSE4.1 SSE4.2 AVX (Cross Platform Deterministic) (16-bit ObjectLayer) (ObjectStream) 
Running scene: Ragdoll
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 71.606223, 0x4c312b4745789d62
Discrete, 2, 130.041938, 0x4c312b4745789d62
Discrete, 3, 187.110595, 0x4c312b4745789d62
Discrete, 4, 240.950944, 0x4c312b4745789d62
Discrete, 5, 289.811308, 0x4c312b4745789d62
Discrete, 6, 336.406596, 0x4c312b4745789d62
Discrete, 7, 385.084843, 0x4c312b4745789d62
Discrete, 8, 430.523532, 0x4c312b4745789d62
Discrete, 9, 438.699406, 0x4c312b4745789d62
Discrete, 10, 450.416379, 0x4c312b4745789d62
Discrete, 11, 456.337385, 0x4c312b4745789d62
Discrete, 12, 467.463723, 0x4c312b4745789d62
Discrete, 13, 479.653266, 0x4c312b4745789d62
Discrete, 14, 489.137341, 0x4c312b4745789d62
Discrete, 15, 504.744689, 0x4c312b4745789d62
Discrete, 16, 503.544247, 0x4c312b4745789d62
LinearCast, 1, 67.801893, 0xb6979cd9fc00610b
LinearCast, 2, 124.757811, 0xb6979cd9fc00610b
LinearCast, 3, 176.110285, 0xb6979cd9fc00610b
LinearCast, 4, 227.316202, 0xb6979cd9fc00610b
LinearCast, 5, 272.823418, 0xb6979cd9fc00610b
LinearCast, 6, 318.684445, 0xb6979cd9fc00610b
LinearCast, 7, 357.785135, 0xb6979cd9fc00610b
LinearCast, 8, 388.939872, 0xb6979cd9fc00610b
LinearCast, 9, 405.994464, 0xb6979cd9fc00610b
LinearCast, 10, 415.091326, 0xb6979cd9fc00610b
LinearCast, 11, 418.342299, 0xb6979cd9fc00610b
LinearCast, 12, 430.325636, 0xb6979cd9fc00610b
LinearCast, 13, 442.675201, 0xb6979cd9fc00610b
LinearCast, 14, 453.804812, 0xb6979cd9fc00610b
LinearCast, 15, 460.400990, 0xb6979cd9fc00610b
LinearCast, 16, 466.102463, 0xb6979cd9fc00610b

AVX2

Single precision x86 64-bit with instructions: SSE2 SSE4.1 SSE4.2 AVX AVX2 F16C LZCNT TZCNT (Cross Platform Deterministic) (16-bit ObjectLayer) (ObjectStream) 
Running scene: Ragdoll
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 71.820937, 0x4c312b4745789d62
Discrete, 2, 131.268782, 0x4c312b4745789d62
Discrete, 3, 190.551501, 0x4c312b4745789d62
Discrete, 4, 243.609363, 0x4c312b4745789d62
Discrete, 5, 297.379175, 0x4c312b4745789d62
Discrete, 6, 350.025668, 0x4c312b4745789d62
Discrete, 7, 396.596517, 0x4c312b4745789d62
Discrete, 8, 434.581874, 0x4c312b4745789d62
Discrete, 9, 450.403601, 0x4c312b4745789d62
Discrete, 10, 458.216008, 0x4c312b4745789d62
Discrete, 11, 463.123647, 0x4c312b4745789d62
Discrete, 12, 481.347406, 0x4c312b4745789d62
Discrete, 13, 489.699345, 0x4c312b4745789d62
Discrete, 14, 505.566897, 0x4c312b4745789d62
Discrete, 15, 513.356889, 0x4c312b4745789d62
Discrete, 16, 517.098920, 0x4c312b4745789d62
LinearCast, 1, 68.810300, 0xb6979cd9fc00610b
LinearCast, 2, 126.412183, 0xb6979cd9fc00610b
LinearCast, 3, 180.955656, 0xb6979cd9fc00610b
LinearCast, 4, 231.855997, 0xb6979cd9fc00610b
LinearCast, 5, 280.253207, 0xb6979cd9fc00610b
LinearCast, 6, 325.147733, 0xb6979cd9fc00610b
LinearCast, 7, 368.339933, 0xb6979cd9fc00610b
LinearCast, 8, 408.488352, 0xb6979cd9fc00610b
LinearCast, 9, 415.251001, 0xb6979cd9fc00610b
LinearCast, 10, 423.046404, 0xb6979cd9fc00610b
LinearCast, 11, 432.250642, 0xb6979cd9fc00610b
LinearCast, 12, 444.356689, 0xb6979cd9fc00610b
LinearCast, 13, 455.247807, 0xb6979cd9fc00610b
LinearCast, 14, 464.913840, 0xb6979cd9fc00610b
LinearCast, 15, 475.344255, 0xb6979cd9fc00610b
LinearCast, 16, 479.274878, 0xb6979cd9fc00610b

Mar 21 '25 21:03 mihe

According to Steam Hardware & Software Survey April 2025, the hardware support for sets is the following:

Set	%
SSE2	100.00%
SSE3	100.00%
SSSE3	99.89%
SSE4.1	99.84%
SSE4.2	99.78%
AVX	97.31%
AVX2	94.66%

Just thought this would be relevant info for this context.

May 30 '25 19:05 DeeJayLSP

Thanks!

May 31 '25 00:05 akien-mga

On MSVC, is there a reason /d2archSSE42 is being used instead of /arch:SSE4.2? The binaries produced are *slightly* different, but /arch:SSE4.2 is officially documented for 64-bit builds (at least since December 2024 according to the Wayback Machine): https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160

Visual Studio's GUI doesn't expose /arch:SSE4.2, so it's possible /d2archSSE42 has received more testing due to being around longer.

Jun 11 '25 23:06 tanksdude

According to Steam Hardware & Software Survey April 2025, the hardware support for sets is the following:

Set % SSE2 100.00% SSE3 100.00% SSSE3 99.89% SSE4.1 99.84% SSE4.2 99.78% AVX 97.31% AVX2 94.66% Just thought this would be relevant info for this context.

Sheering off users for a negligible difference. Why was this merged?

Sep 25 '25 19:09 Ieida