command-t icon indicating copy to clipboard operation
command-t copied to clipboard

Explore how SIMD could be used to eek out more perf

Open wincent opened this issue 1 year ago • 2 comments

Not sure if there's a "there" there, but I should do some research to see if anybody has used SIMD to do fuzzy text matching.

Update: Found some prior art:

wincent avatar Sep 24 '24 07:09 wincent

Here's a not-useful SIMD change (no effect, because it doesn't run in a loop):

diff --git a/lua/wincent/commandt/lib/matcher.c b/lua/wincent/commandt/lib/matcher.c
index e75fa39..0db3772 100644
--- a/lua/wincent/commandt/lib/matcher.c
+++ b/lua/wincent/commandt/lib/matcher.c
@@ -5,6 +5,7 @@
 
 #include "matcher.h"
 
+#include <arm_neon.h>
 #include <assert.h> /* for assert */
 #include <pthread.h> /* for pthread_create, pthread_join etc */
 #include <stdatomic.h> /* for atomic_fetch_add(), atomic_load() */
@@ -264,13 +265,52 @@ void commandt_result_free(result_t *result) {
 
 static long calculate_bitmask(const char *str, unsigned long length) {
     long mask = 0;
-    for (unsigned long i = 0; i < length; i++) {
-        if (str[i] >= 'a' && str[i] <= 'z') {
-            mask |= (1 << (str[i] - 'a'));
-        } else if (str[i] >= 'A' && str[i] <= 'Z') {
-            mask |= (1 << (str[i] - 'A'));
+
+    // Process 16 characters at a time
+    while (length >= 16) {
+        uint8x16_t chars = vld1q_u8((const uint8_t *)str);
+
+        // Create masks for a-z and A-Z ranges
+        uint8x16_t lower_range = vandq_u8(
+            vcgeq_u8(chars, vdupq_n_u8('a')), vcleq_u8(chars, vdupq_n_u8('z'))
+        );
+
+        uint8x16_t upper_range = vandq_u8(
+            vcgeq_u8(chars, vdupq_n_u8('A')), vcleq_u8(chars, vdupq_n_u8('Z'))
+        );
+
+        // Extract results and update mask
+        uint8_t results[16];
+        vst1q_u8(results, chars);
+
+        uint8_t ranges[16];
+        vst1q_u8(ranges, vorrq_u8(lower_range, upper_range));
+
+        for (int i = 0; i < 16; i++) {
+            if (ranges[i]) {
+                char c = results[i];
+                if (c >= 'a' && c <= 'z') {
+                    mask |= (1L << (c - 'a'));
+                } else if (c >= 'A' && c <= 'Z') {
+                    mask |= (1L << (c - 'A'));
+                }
+            }
+        }
+
+        str += 16;
+        length -= 16;
+    }
+
+    // Handle remaining characters
+    while (length--) {
+        char c = *str++;
+        if (c >= 'a' && c <= 'z') {
+            mask |= (1L << (c - 'a'));
+        } else if (c >= 'A' && c <= 'Z') {
+            mask |= (1L << (c - 'A'));
         }
     }
+
     return mask;
 }
 

Obviously not something I'd ship in any case, as it is ARM-specific, but I wanted to at least try some SIMD code before getting in too deep.

wincent avatar Feb 12 '25 23:02 wincent

Perhaps some interesting ideas in this blog post, "How we made JSON.stringify more than twice as fast". Specifically they are using SIMD to search for characters that need to be escaped, and they have two paths: one for long strings, and another for short ones:

  • For longer strings, we switch to dedicated hardware SIMD instructions (e.g., ARM64 Neon). This allows us to load a much larger chunk of the string into a wide SIMD register and check multiple bytes for any escapable characters at once in just a few instructions. (source)
  • For shorter strings, where the setup cost of hardware instructions would be too high, we use a technique called SWAR (SIMD Within A Register). This approach uses clever bitwise logic on standard general-purpose registers to process multiple characters at once with very low overhead. (source)

In V8, the threshold where the SIMD approach kicks in is 32 characters and above. The core of the "SWAR" code that actually identifies whether escaping is needed looks like this:

constexpr bool NeedsEscape(uint32_t input) {
  constexpr uint32_t mask_0x20 = 0x20202020u;
  constexpr uint32_t mask_0x22 = 0x22222222u;
  constexpr uint32_t mask_0x5c = 0x5C5C5C5Cu;
  constexpr uint32_t mask_0x01 = 0x01010101u;
  constexpr uint32_t mask_msb = 0x80808080u;
  // Escape control characters (< 0x20).
  const uint32_t has_lt_0x20 = input - mask_0x20;
  // Escape double quotation mark (0x22).
  const uint32_t has_0x22 = (input ^ mask_0x22) - mask_0x01;
  // Escape backslash (0x5C).
  const uint32_t has_0x5c = (input ^ mask_0x5c) - mask_0x01;
  // Chars >= 0x7F don't need escaping.
  const uint32_t result_mask = ~input & mask_msb;
  const uint32_t result = ((has_lt_0x20 | has_0x22 | has_0x5c) & result_mask);
  return result != 0;
}

As always with these things, the question will be whether the set-up overhead justifies itself by allowing us to scan quickly and far enough ahead to find the next character that we're looking for.

wincent avatar Aug 05 '25 10:08 wincent