55% strlen and memchr optimization with SIMD on x86-64 | Macros config SIMD
macros config SIMD
Added including of a compiler-specific intrinsics library.
#if !defined(IMGUI_DISABLE_SIMD)
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
#endif
Updated macros for configuring a build from SIMD to x86-64.
- IMGUI_DISABLE_SIMD
- IMGUI_DISABLE_SSE
- IMGUI_DISABLE_SSE4_2
- IMGUI_DISABLE_AVX
- IMGUI_DISABLE_AVX2
- IMGUI_ENABLE_SSE
- IMGUI_ENABLE_SSE4_2
- IMGUI_ENABLE_AVX
- IMGUI_ENABLE_AVX2
#if (defined __x86_64__ || defined _M_X64) && !defined(IMGUI_DISABLE_SIMD)
#if (defined __SSE__ || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1))) && !defined(IMGUI_DISABLE_SSE)
#define IMGUI_ENABLE_SSE
#endif
#if defined (__SSE4_2__) && !defined(IMGUI_DISABLE_SSE4_2)
#define IMGUI_ENABLE_SSE4_2
#endif
#if (defined __AVX__) && !defined(IMGUI_DISABLE_AVX)
#define IMGUI_ENABLE_AVX
#endif
#if (defined __AVX2__) && !defined(IMGUI_DISABLE_AVX2)
#define IMGUI_ENABLE_AVX2
#endif
#endif
SIMD ImStrlen and ImMemchr
Created optimized ImStrlen and ImMemchr functions on SSE and AVX2. Replaced using strlen and memchr with ImStrlen and ImMemchr.
Benchmark
Benchmark
System Specifications
CPU: Intel Core i9-10980XE
- Cores Frequency: 4600 MHz (All Cores)
- Uncore Frequency: 3200 Mhz
- Cores/Threads: 18 Cores / 36 Threads
RAM: 128GB (4x32 GB)
- Memory Frequency: 4000 MHz
- Timings: 18-22-22-42
- Channels: Quad-channel
BIOS settings
- Power saving disabled
- All frequencies and voltages are fixed
- AVX offset 400 MHz
ImStrlen benchmark
Description
Search null terminator \0, in a std::string buffer filled with random ASCII characters. Buffer sizes range from 1 MB to 1 GB. Various memchr implementations using SSE and AVX2 are tested for performance.
Google benchmark results
ImStrlen benchmark
Run on (36 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x18)
L1 Instruction 32 KiB (x18)
L2 Unified 1024 KiB (x18)
L3 Unified 25344 KiB (x1)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
ImStrlen_AVX2_PREFETCH/1048576 0.010 ms 0.010 ms 74667 bytes_per_second=93.3337Gi/s
ImStrlen_AVX2_PREFETCH/2097152 0.052 ms 0.053 ms 10000 bytes_per_second=36.7647Gi/s
ImStrlen_AVX2_PREFETCH/16777216 0.478 ms 0.487 ms 1445 bytes_per_second=32.1111Gi/s
ImStrlen_AVX2_PREFETCH/134217728 6.75 ms 6.70 ms 112 bytes_per_second=18.6667Gi/s
ImStrlen_AVX2_PREFETCH/1073741824 56.6 ms 56.8 ms 11 bytes_per_second=17.6Gi/s
ImStrlen_SSE_PREFETCH/1048576 0.020 ms 0.020 ms 37333 bytes_per_second=49.6449Gi/s
ImStrlen_SSE_PREFETCH/2097152 0.058 ms 0.059 ms 11200 bytes_per_second=33.3333Gi/s
ImStrlen_SSE_PREFETCH/16777216 0.516 ms 0.516 ms 1000 bytes_per_second=30.303Gi/s
ImStrlen_SSE_PREFETCH/134217728 6.60 ms 6.70 ms 112 bytes_per_second=18.6667Gi/s
ImStrlen_SSE_PREFETCH/1073741824 55.9 ms 56.8 ms 11 bytes_per_second=17.6Gi/s
ImStrlen_CSTD/1048576 0.059 ms 0.059 ms 10000 bytes_per_second=16.4474Gi/s
ImStrlen_CSTD/2097152 0.118 ms 0.117 ms 5600 bytes_per_second=16.6667Gi/s
ImStrlen_CSTD/16777216 0.984 ms 0.983 ms 747 bytes_per_second=15.8936Gi/s
ImStrlen_CSTD/134217728 9.93 ms 10.0 ms 75 bytes_per_second=12.5Gi/s
ImStrlen_CSTD/1073741824 79.9 ms 79.9 ms 9 bytes_per_second=12.5217Gi/s
ImMemchr benchmark
Description
Search for all lines of length 131, ending with \n, in a std::string buffer filled with random ASCII characters. Buffer sizes range from 1 MB to 1 GB. Various memchr implementations using SSE and AVX2 are tested for performance.
Google benchmark results
ImMemchr benchmark
Run on (36 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x18)
L1 Instruction 32 KiB (x18)
L2 Unified 1024 KiB (x18)
L3 Unified 25344 KiB (x1)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
ImMemchr_AVX2_PREFETCH/1048576 0.048 ms 0.048 ms 14452 bytes_per_second=20.5284Gi/s
ImMemchr_AVX2_PREFETCH/2097152 0.096 ms 0.096 ms 7467 bytes_per_second=20.2908Gi/s
ImMemchr_AVX2_PREFETCH/16777216 0.799 ms 0.802 ms 896 bytes_per_second=19.4783Gi/s
ImMemchr_AVX2_PREFETCH/134217728 7.28 ms 7.29 ms 90 bytes_per_second=17.1429Gi/s
ImMemchr_AVX2_PREFETCH/1073741824 58.2 ms 58.2 ms 11 bytes_per_second=17.1707Gi/s
ImMemchr_AVX2/1048576 0.053 ms 0.053 ms 10000 bytes_per_second=18.3824Gi/s
ImMemchr_AVX2/2097152 0.106 ms 0.107 ms 6400 bytes_per_second=18.1818Gi/s
ImMemchr_AVX2/16777216 0.924 ms 0.941 ms 747 bytes_per_second=16.6Gi/s
ImMemchr_AVX2/134217728 9.97 ms 9.79 ms 75 bytes_per_second=12.766Gi/s
ImMemchr_AVX2/1073741824 79.8 ms 78.1 ms 7 bytes_per_second=12.8Gi/s
ImMemchr_SSE_PREFETCH/1048576 0.056 ms 0.056 ms 11200 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/2097152 0.112 ms 0.112 ms 5600 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/16777216 0.918 ms 0.920 ms 747 bytes_per_second=16.9773Gi/s
ImMemchr_SSE_PREFETCH/134217728 8.25 ms 8.12 ms 75 bytes_per_second=15.3846Gi/s
ImMemchr_SSE_PREFETCH/1073741824 65.9 ms 65.3 ms 11 bytes_per_second=15.3043Gi/s
ImMemchr_SSE/1048576 0.059 ms 0.059 ms 11200 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/2097152 0.117 ms 0.117 ms 6400 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/16777216 1.02 ms 1.00 ms 640 bytes_per_second=15.6098Gi/s
ImMemchr_SSE/134217728 10.4 ms 10.4 ms 75 bytes_per_second=12Gi/s
ImMemchr_SSE/1073741824 83.7 ms 83.3 ms 9 bytes_per_second=12Gi/s
ImMemchr_CSTD/1048576 0.068 ms 0.068 ms 11200 bytes_per_second=14.2857Gi/s
ImMemchr_CSTD/2097152 0.134 ms 0.134 ms 5600 bytes_per_second=14.5833Gi/s
ImMemchr_CSTD/16777216 1.17 ms 1.17 ms 640 bytes_per_second=13.3333Gi/s
ImMemchr_CSTD/134217728 11.5 ms 11.5 ms 64 bytes_per_second=10.8936Gi/s
ImMemchr_CSTD/1073741824 89.9 ms 91.5 ms 7 bytes_per_second=10.9268Gi/s
Hello,
Thanks for the PR! Out of curiosity, can you describe what prompted you to undergo this optimization?
Where does the code for the 2 implementations come from? As the problem is relatively well defined and "simple", I am wondering if they may be well known implementations? Do you have links that may describe the approach (I'm not sure I understand all the code, so any further comment may be useful to facilitate possible future maintenance)
It's probably well valuable for fast-forwarding during text display, I'll be benchmarking in that specific scenario.
Hello. I am learning low-level optimizations using SIMD on x86-64. I wrote the code from scratch for the sake of practice. I optimized the SSE and AVX2 implementations of ImMemchr to minimize clock cycles as much as possible. I took into account loading unaligned data, preloading data into the cache, I also took security into account. I optimized iteratively until I got to the current code, with maximum optimization. This is the benchmark source code and binary link.
I also tested ImGui::TextUnformatted with a 16 MB std::string buffer, in which every 131 characters is \n, the performance on AVX2 ImMemchr increased by about 60%, compared to regular memchr. ImGui::TextUnformatted call was 1.6 ms per frame, became 0.98 ms.
I added an optimized ImStrlen with implementations on SSE and AVX2.
I'm very skeptical that we should replace the std functions such as strlen() or strchr().
If they can be optimized, then they should be optimized in the C library, not in the ImGui library.
Standard functions may not always be able to take advantage of CPU-features. We typically have to reinvent many small wheels for projects like this.
But empirically, I noted an issue that sometimes affected such attempts: std lib functions we call are almost always compiled optimized, while our own functions are compiled by the project. For certain functions, the potential loss of project compiling imgui in debug mode may be meaningful or problematic (this is why, e.g. I haven't provided a qsort replacement, which is quite a frustrating hurdle to trip on, because libc qsort is annoyingly not allowing a user data pointer).
Either way, this should be tested in a typical real world scenario + some edge cases to see if the gain may be meaningful and if it's worth going forward with this.
- I have pushed 1ec99f4 into
master: "added ImStrlen/ImMemchr #define to facilitate experimenting with variations" - This is effectively allows removing lots of the "noise" from this PR, by reducing it to the core changes.
- I have rebased and force-pushed your branch over latest master.
e.g. before https://github.com/ocornut/imgui/commit/c73f83538d4f6d3b21a573bc0909f8bf42789497 after https://github.com/ocornut/imgui/commit/c67c2026fdf282ce60c254bbf4e7fad988924871
So this is much neater now and will be unlikely to conflict.
If I add #ifndef statements in imgui_internal.h then it become possible to provide this sorts of optimization entirely externally.