valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Using intrinsics to optimize counting HyperLogLog trailing bits

Open mapleFU opened this issue 1 year ago • 21 comments

Godbolt link: https://godbolt.org/z/3YPvxsr5s

__builtin_ctz would generate shorter code than hand-written loop.

mapleFU avatar Jul 30 '24 16:07 mapleFU

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.44%. Comparing base (08aaeea) to head (ea6505c). Report is 22 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #846      +/-   ##
============================================
- Coverage     70.59%   70.44%   -0.16%     
============================================
  Files           112      114       +2     
  Lines         61512    61725     +213     
============================================
+ Hits          43427    43481      +54     
- Misses        18085    18244     +159     
Files Coverage Δ
src/hyperloglog.c 91.17% <100.00%> (-0.05%) :arrow_down:
src/intrinsics.h 100.00% <100.00%> (ø)

... and 26 files with indirect coverage changes

codecov[bot] avatar Jul 30 '24 16:07 codecov[bot]

I don't have end-to-end benchmark

#include <random>
#include <cstdint>
#include <bit>

constexpr int n = 1000;

static void LoopBased(benchmark::State& state) {
    std::mt19937 rng(0);
    std::uniform_int_distribution<int64_t> uniform_dist(1, 1125899906842624);
    std::vector<int64_t> value;
    for (int i = 0; i < n; ++i) {
       value.push_back(uniform_dist(rng));
    }
    int cnt = 0;
    for (auto _ : state) {
       int count = 1;
       int64_t hash = value[cnt % n];
       int bit = 1;
       while ((hash & bit) == 0) {
         count++;
         bit <<= 1;
       }
       ::benchmark::DoNotOptimize(count);
       ++cnt;
    }
}
// Register the function as a benchmark
BENCHMARK(LoopBased);

static void CtzBased(benchmark::State& state) {
    std::mt19937 rng(0);
    std::uniform_int_distribution<int64_t> uniform_dist(1, 1125899906842624);
    std::vector<int64_t> value;
    for (int i = 0; i < n; ++i) {
       value.push_back(uniform_dist(rng));
    }
    int cnt = 0;
    for (auto _ : state) {
       int count = 1;
       int64_t hash = value[cnt % n];
       count += std::countl_zero(static_cast<uint64_t>(hash));
       ::benchmark::DoNotOptimize(count);
       ++cnt;
    }
}
BENCHMARK(CtzBased);

Quickbench has some error message and I run it on my x86 3800x cpu with gcc12 and Release -O2

-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
LoopBased        1.36 ns         1.36 ns    474427671
CtzBased        0.479 ns        0.479 ns   1000000000

mapleFU avatar Aug 02 '24 17:08 mapleFU

Also this is quickbench in x86 platform: https://quick-bench.com/q/7dlzmaBJDz9Xnfzhsk-1Iw7TP0I I believe the loop based impl is slower

mapleFU avatar Aug 02 '24 17:08 mapleFU

LGTM, @zuiderkwast @PingXie any other ideas before the merge?

enjoy-binbin avatar Aug 03 '24 15:08 enjoy-binbin

It might make sense to have a #define for __builtin_ctzll incase it's not supported in the compiler.

madolson avatar Aug 07 '24 20:08 madolson

It might make sense to have a #define for __builtin_ctzll incase it's not supported in the compiler.

#if defined(__clang__) || defined(__GNUC__)
  return static_cast<int>(__builtin_clzll(value));
#elif defined(_MSC_VER)
  unsigned long index;
  i_BitScanReverse64(&index, value);
  return 63 - static_cast<int>(index);
#else
  int bitpos = 0;
  while (value != 0) {
    value >>= 1;
    ++bitpos;
  }
  return 64 - bitpos;
#endif

I do this because server uses __builtin_clz ( see: https://github.com/valkey-io/valkey/blob/7424620ca06c3172ee21af6384080cd6d6463c53/src/dict.c#L1624 , https://github.com/valkey-io/valkey/blob/7424620ca06c3172ee21af6384080cd6d6463c53/src/server.c#L836 )

I also find some code checks that: https://github.com/valkey-io/valkey/blob/7424620ca06c3172ee21af6384080cd6d6463c53/deps/hdr_histogram/hdr_histogram.c#L164

So I don't know the idiom here :-(

@madolson do you think something like this is ok? ( borrowed from https://github.com/apache/arrow/blob/b33f040640c7ccb3e6a8406e4d3158608c597025/cpp/src/arrow/util/bit_util.h#L198 )

mapleFU avatar Aug 09 '24 03:08 mapleFU

Yeah, using __builtin_ctz is good.

So I don't know the idiom here :-(

We need an indirection in config.h. Here is an example

158 #if __GNUC__ >= 3
159 #define likely(x) __builtin_expect(!!(x), 1)
160 #define unlikely(x) __builtin_expect(!!(x), 0)
161 #else
162 #define likely(x) (x)
163 #define unlikely(x) (x)
164 #endif

PingXie avatar Aug 12 '24 02:08 PingXie

@PingXie I've add an ad-hoc counting_trailing_bits. I didn't add it in config.h because this doesn't like likely, which is an empty operation when compiler doesn't have the hint. Instead, it should fall into loop here.

mapleFU avatar Aug 12 '24 05:08 mapleFU

@PingXie I've add an ad-hoc counting_trailing_bits. I didn't add it in config.h because this doesn't like likely, which is an empty operation when compiler doesn't have the hint. Instead, it should fall into loop here.

You should be able to move this function to config.h. Make sure you keep inline so that you won't upset the linker with multiple copies of the same function, which also hurts performance, even if the linker doesn't complain (as in the MSVC case).

I would highly recommend moving any intrinsic function wrappers or their "emulation" to config.h. This maximizes their re-usability.

PingXie avatar Aug 13 '24 07:08 PingXie

BTW, +1 on dropping Windows flavor. Windows support will be a major decision. We shouldn't create precedence before going through the process.

PingXie avatar Aug 13 '24 07:08 PingXie

IMO, this might be a seperate issue, which we can add a bit_util.h or something else? My main concerns is listed below:

  1. There're already some adhoc __builtin_clz and __builtin_clzll used 🤔, I think I can draft another pr which extract them all
  2. Still config.h seems to defines enviroment rather than operations

Anyway I don't have strong intention, migrate to config.h in this patch or add a new patch is all ok to me. Other usage is here: https://github.com/valkey-io/valkey/pull/846#issuecomment-2277079195

mapleFU avatar Aug 13 '24 07:08 mapleFU

diff --git a/src/config.h b/src/config.h
index 201e42197..9866b9165 100644
--- a/src/config.h
+++ b/src/config.h
@@ -348,4 +348,48 @@ void setcpuaffinity(const char *cpulist);
 #endif
 #endif
 
+
+static inline int32_t count_leading_zeros_32(int32_t value)
+{
+#if defined(__clang__) || defined(__GNUC__)
+    if (value == 0) return 32;
+    return __builtin_clz(value); /* smallest power of 2 containing value */
+#else
+    int bitpos = 0;
+    while (value != 0) {
+        value >>= 1;
+        ++bitpos;
+    }
+    return 32 - bitpos;
+#endif
+}
+
+static inline int32_t count_leading_zeros_64(int64_t value)
+{
+#if defined(__clang__) || defined(__GNUC__)
+    if (value == 0) return 64;
+    return __builtin_clzll(value); /* smallest power of 2 containing value */
+#else
+    int bitpos = 0;
+    while (value != 0) {
+        value >>= 1;
+        ++bitpos;
+    }
+    return 64 - bitpos;
+#endif
+}
+
+static inline int32_t count_trailing_zeros_64(uint64_t value) {
+#if defined(__clang__) || defined(__GNUC__)
+    return __builtin_ctzll(value);
+#else
+    int bitpos = 0;
+    while (value & 1 == 0) {
+        value >>= 1;
+        ++bitpos;
+    }
+    return bitpos;
+#endif
+}
+
 #endif
diff --git a/src/dict.c b/src/dict.c
index 2eb3dd386..cc49cc855 100644
--- a/src/dict.c
+++ b/src/dict.c
@@ -44,6 +44,7 @@
 #include <limits.h>
 #include <sys/time.h>
 
+#include "config.h"
 #include "dict.h"
 #include "zmalloc.h"
 #include "serverassert.h"
@@ -1621,7 +1622,7 @@ static signed char _dictNextExp(unsigned long size) {
     if (size <= DICT_HT_INITIAL_SIZE) return DICT_HT_INITIAL_EXP;
     if (size >= LONG_MAX) return (8 * sizeof(long) - 1);
 
-    return 8 * sizeof(long) - __builtin_clzl(size - 1);
+    return 8 * sizeof(long) - count_leading_zeros_64(size - 1);
 }
 
 /* Finds and returns the position within the dict where the provided key should
diff --git a/src/server.c b/src/server.c
index 6979d0981..adb3edae3 100644
--- a/src/server.c
+++ b/src/server.c
@@ -833,7 +833,7 @@ int clientsCronTrackExpansiveClients(client *c, int time_idx) {
  * For more details see CLIENT_MEM_USAGE_BUCKETS documentation in server.h. */
 static inline clientMemUsageBucket *getMemUsageBucket(size_t mem) {
     int size_in_bits = 8 * (int)sizeof(mem);
-    int clz = mem > 0 ? __builtin_clzl(mem) : size_in_bits;
+    int clz = mem > 0 ? count_leading_zeros_64(mem) : size_in_bits;
     int bucket_idx = size_in_bits - clz;
     if (bucket_idx > CLIENT_MEM_USAGE_BUCKET_MAX_LOG)
         bucket_idx = CLIENT_MEM_USAGE_BUCKET_MAX_LOG;

I can also add patch like this 🤔 Waiting for your ideas...

mapleFU avatar Aug 14 '24 03:08 mapleFU

Thanks for identifying other intrinsic function usage, @mapleFU.

I am fine with addressing the other uses of intrinsic functions separately but I am hoping that we could at least lay some good foundation for the future code hygiene improvement work. I think you have a good point on config.h being a bit too stretched to host these intrinsic functions. What do you think about introducing a new intrinsic.c file and moving your new function there? Also I would like to use a name that is more relatable to the original intrinsic function so rather than calling the new function count_leading_zeros_64, how about builtin_clzl, i.e., without __? Even though intrinsic functions like __builtin_clzl are not part of the C standard, they are quite established among the popular compilers such as GCC/ICC/clang (with MSVC being the notable exception) so it would be good if we don't have to learn a new concept.

PingXie avatar Aug 14 '24 07:08 PingXie

@PingXie

I think you have a good point on config.h being a bit too stretched to host these intrinsic functions. What do you think about introducing a new intrinsic.c file and moving your new function there?

This way looks good to me

how about builtin_clzl, i.e., without __? Even though intrinsic functions like __builtin_clzl are not part of the C standard, they are quite established among the popular compilers such as GCC/ICC/clang (with MSVC being the notable exception) so it would be good if we don't have to learn a new concept.

This also LGTM

mapleFU avatar Aug 14 '24 07:08 mapleFU

Why not put these functions in util.h?

zuiderkwast avatar Aug 14 '24 10:08 zuiderkwast

Why not put these functions in util.h?

These all LGTM. Should I put them in this patch or draft another one after this is merged?

mapleFU avatar Aug 14 '24 10:08 mapleFU

Why not put these functions in util.h?

Different update cadence/stability.

'Intrinsic' are not typical helper functions but more focused on performance. They rarely need to be updated once written. There are many of them and we could start using more.

So I would like to tug them all in the same place that I can easily skip.

PingXie avatar Aug 14 '24 14:08 PingXie

So, from the discussion above, should I extract them in current patch or we can move forward?

mapleFU avatar Aug 15 '24 15:08 mapleFU

So, from the discussion above, should I extract them in current patch or we can move forward?

@zuiderkwast do you have a strong opinion against intrinsic.c?

From my end, it would be good to go if you could move this new function into a new intrinsic.c file.

PingXie avatar Aug 16 '24 06:08 PingXie

@zuiderkwast do you have a strong opinion against intrinsic.c?

No, I don't mind.

zuiderkwast avatar Aug 16 '24 08:08 zuiderkwast

So, from the discussion here, I need add a intrinsics.h and put CountingTrailingZeros in intrinsics.h

My remaining questions:

  1. Like https://github.com/valkey-io/valkey/pull/846#issuecomment-2287786840 , should I put other intrinsics here?

mapleFU avatar Aug 17 '24 10:08 mapleFU

@PingXie I've added the intrinsics.h now

mapleFU avatar Aug 21 '24 13:08 mapleFU

@PingXie Comment resolved, mind take a look again?

mapleFU avatar Aug 23 '24 12:08 mapleFU

@enjoy-binbin @madolson could we move forward and check in this? Or there're more things I need to do?

mapleFU avatar Aug 27 '24 15:08 mapleFU

I think @pingxie will merge it, although there are a bunch of test failures I am looking through. (Seem like some old issues, let me re-trigger them)

madolson avatar Aug 27 '24 16:08 madolson

There are still some test failures but I don't think they are related.

PingXie avatar Aug 28 '24 03:08 PingXie

Thanks all!

mapleFU avatar Aug 28 '24 03:08 mapleFU