macros config SIMD

Added including of a compiler-specific intrinsics library.

#if !defined(IMGUI_DISABLE_SIMD)
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
#endif

Updated macros for configuring a build from SIMD to x86-64.

IMGUI_DISABLE_SIMD
IMGUI_DISABLE_SSE
IMGUI_DISABLE_SSE4_2
IMGUI_DISABLE_AVX
IMGUI_DISABLE_AVX2
IMGUI_ENABLE_SSE
IMGUI_ENABLE_SSE4_2
IMGUI_ENABLE_AVX
IMGUI_ENABLE_AVX2

#if (defined __x86_64__ || defined _M_X64) && !defined(IMGUI_DISABLE_SIMD)
#if (defined __SSE__  || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1))) && !defined(IMGUI_DISABLE_SSE)
#define IMGUI_ENABLE_SSE
#endif
#if defined (__SSE4_2__) && !defined(IMGUI_DISABLE_SSE4_2)
#define IMGUI_ENABLE_SSE4_2
#endif
#if (defined __AVX__) && !defined(IMGUI_DISABLE_AVX)
#define IMGUI_ENABLE_AVX
#endif
#if (defined __AVX2__) && !defined(IMGUI_DISABLE_AVX2)
#define IMGUI_ENABLE_AVX2
#endif
#endif

SIMD ImStrlen and ImMemchr

Created optimized ImStrlen and ImMemchr functions on SSE and AVX2. Replaced using strlen and memchr with ImStrlen and ImMemchr.

Benchmark

System Specifications

CPU: Intel Core i9-10980XE

Cores Frequency: 4600 MHz (All Cores)
Uncore Frequency: 3200 Mhz
Cores/Threads: 18 Cores / 36 Threads

RAM: 128GB (4x32 GB)

Memory Frequency: 4000 MHz
Timings: 18-22-22-42
Channels: Quad-channel

BIOS settings

Power saving disabled
All frequencies and voltages are fixed
AVX offset 400 MHz

ImStrlen benchmark

Description

Search null terminator \0, in a std::string buffer filled with random ASCII characters. Buffer sizes range from 1 MB to 1 GB. Various memchr implementations using SSE and AVX2 are tested for performance.

Google benchmark results

ImStrlen benchmark

Run on (36 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x18)
  L1 Instruction 32 KiB (x18)
  L2 Unified 1024 KiB (x18)
  L3 Unified 25344 KiB (x1)
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
ImStrlen_AVX2_PREFETCH/1048576         0.010 ms        0.010 ms        74667 bytes_per_second=93.3337Gi/s
ImStrlen_AVX2_PREFETCH/2097152         0.052 ms        0.053 ms        10000 bytes_per_second=36.7647Gi/s
ImStrlen_AVX2_PREFETCH/16777216        0.478 ms        0.487 ms         1445 bytes_per_second=32.1111Gi/s
ImStrlen_AVX2_PREFETCH/134217728        6.75 ms         6.70 ms          112 bytes_per_second=18.6667Gi/s
ImStrlen_AVX2_PREFETCH/1073741824       56.6 ms         56.8 ms           11 bytes_per_second=17.6Gi/s
ImStrlen_SSE_PREFETCH/1048576          0.020 ms        0.020 ms        37333 bytes_per_second=49.6449Gi/s
ImStrlen_SSE_PREFETCH/2097152          0.058 ms        0.059 ms        11200 bytes_per_second=33.3333Gi/s
ImStrlen_SSE_PREFETCH/16777216         0.516 ms        0.516 ms         1000 bytes_per_second=30.303Gi/s
ImStrlen_SSE_PREFETCH/134217728         6.60 ms         6.70 ms          112 bytes_per_second=18.6667Gi/s
ImStrlen_SSE_PREFETCH/1073741824        55.9 ms         56.8 ms           11 bytes_per_second=17.6Gi/s
ImStrlen_CSTD/1048576                  0.059 ms        0.059 ms        10000 bytes_per_second=16.4474Gi/s
ImStrlen_CSTD/2097152                  0.118 ms        0.117 ms         5600 bytes_per_second=16.6667Gi/s
ImStrlen_CSTD/16777216                 0.984 ms        0.983 ms          747 bytes_per_second=15.8936Gi/s
ImStrlen_CSTD/134217728                 9.93 ms         10.0 ms           75 bytes_per_second=12.5Gi/s
ImStrlen_CSTD/1073741824                79.9 ms         79.9 ms            9 bytes_per_second=12.5217Gi/s

ImMemchr benchmark

Description

Search for all lines of length 131, ending with \n, in a std::string buffer filled with random ASCII characters. Buffer sizes range from 1 MB to 1 GB. Various memchr implementations using SSE and AVX2 are tested for performance.

Google benchmark results

ImMemchr benchmark

Run on (36 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x18)
  L1 Instruction 32 KiB (x18)
  L2 Unified 1024 KiB (x18)
  L3 Unified 25344 KiB (x1)
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
ImMemchr_AVX2_PREFETCH/1048576         0.048 ms        0.048 ms        14452 bytes_per_second=20.5284Gi/s
ImMemchr_AVX2_PREFETCH/2097152         0.096 ms        0.096 ms         7467 bytes_per_second=20.2908Gi/s
ImMemchr_AVX2_PREFETCH/16777216        0.799 ms        0.802 ms          896 bytes_per_second=19.4783Gi/s
ImMemchr_AVX2_PREFETCH/134217728        7.28 ms         7.29 ms           90 bytes_per_second=17.1429Gi/s
ImMemchr_AVX2_PREFETCH/1073741824       58.2 ms         58.2 ms           11 bytes_per_second=17.1707Gi/s
ImMemchr_AVX2/1048576                  0.053 ms        0.053 ms        10000 bytes_per_second=18.3824Gi/s
ImMemchr_AVX2/2097152                  0.106 ms        0.107 ms         6400 bytes_per_second=18.1818Gi/s
ImMemchr_AVX2/16777216                 0.924 ms        0.941 ms          747 bytes_per_second=16.6Gi/s
ImMemchr_AVX2/134217728                 9.97 ms         9.79 ms           75 bytes_per_second=12.766Gi/s
ImMemchr_AVX2/1073741824                79.8 ms         78.1 ms            7 bytes_per_second=12.8Gi/s
ImMemchr_SSE_PREFETCH/1048576          0.056 ms        0.056 ms        11200 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/2097152          0.112 ms        0.112 ms         5600 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/16777216         0.918 ms        0.920 ms          747 bytes_per_second=16.9773Gi/s
ImMemchr_SSE_PREFETCH/134217728         8.25 ms         8.12 ms           75 bytes_per_second=15.3846Gi/s
ImMemchr_SSE_PREFETCH/1073741824        65.9 ms         65.3 ms           11 bytes_per_second=15.3043Gi/s
ImMemchr_SSE/1048576                   0.059 ms        0.059 ms        11200 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/2097152                   0.117 ms        0.117 ms         6400 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/16777216                   1.02 ms         1.00 ms          640 bytes_per_second=15.6098Gi/s
ImMemchr_SSE/134217728                  10.4 ms         10.4 ms           75 bytes_per_second=12Gi/s
ImMemchr_SSE/1073741824                 83.7 ms         83.3 ms            9 bytes_per_second=12Gi/s
ImMemchr_CSTD/1048576                  0.068 ms        0.068 ms        11200 bytes_per_second=14.2857Gi/s
ImMemchr_CSTD/2097152                  0.134 ms        0.134 ms         5600 bytes_per_second=14.5833Gi/s
ImMemchr_CSTD/16777216                  1.17 ms         1.17 ms          640 bytes_per_second=13.3333Gi/s
ImMemchr_CSTD/134217728                 11.5 ms         11.5 ms           64 bytes_per_second=10.8936Gi/s
ImMemchr_CSTD/1073741824                89.9 ms         91.5 ms            7 bytes_per_second=10.9268Gi/s

Feb 21 '25 15:02 Kraionix

Hello,

Thanks for the PR! Out of curiosity, can you describe what prompted you to undergo this optimization?

Where does the code for the 2 implementations come from? As the problem is relatively well defined and "simple", I am wondering if they may be well known implementations? Do you have links that may describe the approach (I'm not sure I understand all the code, so any further comment may be useful to facilitate possible future maintenance)

It's probably well valuable for fast-forwarding during text display, I'll be benchmarking in that specific scenario.

Feb 21 '25 15:02 ocornut

Hello. I am learning low-level optimizations using SIMD on x86-64. I wrote the code from scratch for the sake of practice. I optimized the SSE and AVX2 implementations of ImMemchr to minimize clock cycles as much as possible. I took into account loading unaligned data, preloading data into the cache, I also took security into account. I optimized iteratively until I got to the current code, with maximum optimization. This is the benchmark source code and binary link.

Feb 21 '25 16:02 Kraionix

I also tested ImGui::TextUnformatted with a 16 MB std::string buffer, in which every 131 characters is \n, the performance on AVX2 ImMemchr increased by about 60%, compared to regular memchr. ImGui::TextUnformatted call was 1.6 ms per frame, became 0.98 ms.

Feb 21 '25 17:02 Kraionix

I added an optimized ImStrlen with implementations on SSE and AVX2.

Mar 01 '25 08:03 Kraionix

I'm very skeptical that we should replace the std functions such as strlen() or strchr().

If they can be optimized, then they should be optimized in the C library, not in the ImGui library.

Mar 01 '25 22:03 DominiquePelle-TomTom

Standard functions may not always be able to take advantage of CPU-features. We typically have to reinvent many small wheels for projects like this.

But empirically, I noted an issue that sometimes affected such attempts: std lib functions we call are almost always compiled optimized, while our own functions are compiled by the project. For certain functions, the potential loss of project compiling imgui in debug mode may be meaningful or problematic (this is why, e.g. I haven't provided a qsort replacement, which is quite a frustrating hurdle to trip on, because libc qsort is annoyingly not allowing a user data pointer).

Either way, this should be tested in a typical real world scenario + some edge cases to see if the gain may be meaningful and if it's worth going forward with this.

Mar 07 '25 10:03 ocornut

I have pushed 1ec99f4 into master : "added ImStrlen/ImMemchr #define to facilitate experimenting with variations"
This is effectively allows removing lots of the "noise" from this PR, by reducing it to the core changes.
I have rebased and force-pushed your branch over latest master.

e.g. before https://github.com/ocornut/imgui/commit/c73f83538d4f6d3b21a573bc0909f8bf42789497 after https://github.com/ocornut/imgui/commit/c67c2026fdf282ce60c254bbf4e7fad988924871

So this is much neater now and will be unlikely to conflict. If I add #ifndef statements in imgui_internal.h then it become possible to provide this sorts of optimization entirely externally.

Mar 07 '25 10:03 ocornut

55% strlen and memchr optimization with SIMD on x86-64 | Macros config SIMD

macros config SIMD

SIMD ImStrlen and ImMemchr

Benchmark

System Specifications

CPU: Intel Core i9-10980XE

RAM: 128GB (4x32 GB)

BIOS settings

ImStrlen benchmark

Description

Google benchmark results

ImMemchr benchmark

Description

Google benchmark results