valkey
valkey copied to clipboard
Using intrinsics to optimize counting HyperLogLog trailing bits
Godbolt link: https://godbolt.org/z/3YPvxsr5s
__builtin_ctz would generate shorter code than hand-written loop.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 70.44%. Comparing base (
08aaeea) to head (ea6505c). Report is 22 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #846 +/- ##
============================================
- Coverage 70.59% 70.44% -0.16%
============================================
Files 112 114 +2
Lines 61512 61725 +213
============================================
+ Hits 43427 43481 +54
- Misses 18085 18244 +159
| Files | Coverage Δ | |
|---|---|---|
| src/hyperloglog.c | 91.17% <100.00%> (-0.05%) |
:arrow_down: |
| src/intrinsics.h | 100.00% <100.00%> (ø) |
I don't have end-to-end benchmark
#include <random>
#include <cstdint>
#include <bit>
constexpr int n = 1000;
static void LoopBased(benchmark::State& state) {
std::mt19937 rng(0);
std::uniform_int_distribution<int64_t> uniform_dist(1, 1125899906842624);
std::vector<int64_t> value;
for (int i = 0; i < n; ++i) {
value.push_back(uniform_dist(rng));
}
int cnt = 0;
for (auto _ : state) {
int count = 1;
int64_t hash = value[cnt % n];
int bit = 1;
while ((hash & bit) == 0) {
count++;
bit <<= 1;
}
::benchmark::DoNotOptimize(count);
++cnt;
}
}
// Register the function as a benchmark
BENCHMARK(LoopBased);
static void CtzBased(benchmark::State& state) {
std::mt19937 rng(0);
std::uniform_int_distribution<int64_t> uniform_dist(1, 1125899906842624);
std::vector<int64_t> value;
for (int i = 0; i < n; ++i) {
value.push_back(uniform_dist(rng));
}
int cnt = 0;
for (auto _ : state) {
int count = 1;
int64_t hash = value[cnt % n];
count += std::countl_zero(static_cast<uint64_t>(hash));
::benchmark::DoNotOptimize(count);
++cnt;
}
}
BENCHMARK(CtzBased);
Quickbench has some error message and I run it on my x86 3800x cpu with gcc12 and Release -O2
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
LoopBased 1.36 ns 1.36 ns 474427671
CtzBased 0.479 ns 0.479 ns 1000000000
Also this is quickbench in x86 platform: https://quick-bench.com/q/7dlzmaBJDz9Xnfzhsk-1Iw7TP0I I believe the loop based impl is slower
LGTM, @zuiderkwast @PingXie any other ideas before the merge?
It might make sense to have a #define for __builtin_ctzll incase it's not supported in the compiler.
It might make sense to have a #define for __builtin_ctzll incase it's not supported in the compiler.
#if defined(__clang__) || defined(__GNUC__)
return static_cast<int>(__builtin_clzll(value));
#elif defined(_MSC_VER)
unsigned long index;
i_BitScanReverse64(&index, value);
return 63 - static_cast<int>(index);
#else
int bitpos = 0;
while (value != 0) {
value >>= 1;
++bitpos;
}
return 64 - bitpos;
#endif
I do this because server uses __builtin_clz ( see: https://github.com/valkey-io/valkey/blob/7424620ca06c3172ee21af6384080cd6d6463c53/src/dict.c#L1624 , https://github.com/valkey-io/valkey/blob/7424620ca06c3172ee21af6384080cd6d6463c53/src/server.c#L836 )
I also find some code checks that: https://github.com/valkey-io/valkey/blob/7424620ca06c3172ee21af6384080cd6d6463c53/deps/hdr_histogram/hdr_histogram.c#L164
So I don't know the idiom here :-(
@madolson do you think something like this is ok? ( borrowed from https://github.com/apache/arrow/blob/b33f040640c7ccb3e6a8406e4d3158608c597025/cpp/src/arrow/util/bit_util.h#L198 )
Yeah, using __builtin_ctz is good.
So I don't know the idiom here :-(
We need an indirection in config.h. Here is an example
158 #if __GNUC__ >= 3
159 #define likely(x) __builtin_expect(!!(x), 1)
160 #define unlikely(x) __builtin_expect(!!(x), 0)
161 #else
162 #define likely(x) (x)
163 #define unlikely(x) (x)
164 #endif
@PingXie I've add an ad-hoc counting_trailing_bits. I didn't add it in config.h because this doesn't like likely, which is an empty operation when compiler doesn't have the hint. Instead, it should fall into loop here.
@PingXie I've add an ad-hoc
counting_trailing_bits. I didn't add it inconfig.hbecause this doesn't like likely, which is an empty operation when compiler doesn't have the hint. Instead, it should fall into loop here.
You should be able to move this function to config.h. Make sure you keep inline so that you won't upset the linker with multiple copies of the same function, which also hurts performance, even if the linker doesn't complain (as in the MSVC case).
I would highly recommend moving any intrinsic function wrappers or their "emulation" to config.h. This maximizes their re-usability.
BTW, +1 on dropping Windows flavor. Windows support will be a major decision. We shouldn't create precedence before going through the process.
IMO, this might be a seperate issue, which we can add a bit_util.h or something else? My main concerns is listed below:
- There're already some adhoc
__builtin_clzand__builtin_clzllused 🤔, I think I can draft another pr which extract them all - Still
config.hseems to defines enviroment rather than operations
Anyway I don't have strong intention, migrate to config.h in this patch or add a new patch is all ok to me. Other usage is here: https://github.com/valkey-io/valkey/pull/846#issuecomment-2277079195
diff --git a/src/config.h b/src/config.h
index 201e42197..9866b9165 100644
--- a/src/config.h
+++ b/src/config.h
@@ -348,4 +348,48 @@ void setcpuaffinity(const char *cpulist);
#endif
#endif
+
+static inline int32_t count_leading_zeros_32(int32_t value)
+{
+#if defined(__clang__) || defined(__GNUC__)
+ if (value == 0) return 32;
+ return __builtin_clz(value); /* smallest power of 2 containing value */
+#else
+ int bitpos = 0;
+ while (value != 0) {
+ value >>= 1;
+ ++bitpos;
+ }
+ return 32 - bitpos;
+#endif
+}
+
+static inline int32_t count_leading_zeros_64(int64_t value)
+{
+#if defined(__clang__) || defined(__GNUC__)
+ if (value == 0) return 64;
+ return __builtin_clzll(value); /* smallest power of 2 containing value */
+#else
+ int bitpos = 0;
+ while (value != 0) {
+ value >>= 1;
+ ++bitpos;
+ }
+ return 64 - bitpos;
+#endif
+}
+
+static inline int32_t count_trailing_zeros_64(uint64_t value) {
+#if defined(__clang__) || defined(__GNUC__)
+ return __builtin_ctzll(value);
+#else
+ int bitpos = 0;
+ while (value & 1 == 0) {
+ value >>= 1;
+ ++bitpos;
+ }
+ return bitpos;
+#endif
+}
+
#endif
diff --git a/src/dict.c b/src/dict.c
index 2eb3dd386..cc49cc855 100644
--- a/src/dict.c
+++ b/src/dict.c
@@ -44,6 +44,7 @@
#include <limits.h>
#include <sys/time.h>
+#include "config.h"
#include "dict.h"
#include "zmalloc.h"
#include "serverassert.h"
@@ -1621,7 +1622,7 @@ static signed char _dictNextExp(unsigned long size) {
if (size <= DICT_HT_INITIAL_SIZE) return DICT_HT_INITIAL_EXP;
if (size >= LONG_MAX) return (8 * sizeof(long) - 1);
- return 8 * sizeof(long) - __builtin_clzl(size - 1);
+ return 8 * sizeof(long) - count_leading_zeros_64(size - 1);
}
/* Finds and returns the position within the dict where the provided key should
diff --git a/src/server.c b/src/server.c
index 6979d0981..adb3edae3 100644
--- a/src/server.c
+++ b/src/server.c
@@ -833,7 +833,7 @@ int clientsCronTrackExpansiveClients(client *c, int time_idx) {
* For more details see CLIENT_MEM_USAGE_BUCKETS documentation in server.h. */
static inline clientMemUsageBucket *getMemUsageBucket(size_t mem) {
int size_in_bits = 8 * (int)sizeof(mem);
- int clz = mem > 0 ? __builtin_clzl(mem) : size_in_bits;
+ int clz = mem > 0 ? count_leading_zeros_64(mem) : size_in_bits;
int bucket_idx = size_in_bits - clz;
if (bucket_idx > CLIENT_MEM_USAGE_BUCKET_MAX_LOG)
bucket_idx = CLIENT_MEM_USAGE_BUCKET_MAX_LOG;
I can also add patch like this 🤔 Waiting for your ideas...
Thanks for identifying other intrinsic function usage, @mapleFU.
I am fine with addressing the other uses of intrinsic functions separately but I am hoping that we could at least lay some good foundation for the future code hygiene improvement work. I think you have a good point on config.h being a bit too stretched to host these intrinsic functions. What do you think about introducing a new intrinsic.c file and moving your new function there? Also I would like to use a name that is more relatable to the original intrinsic function so rather than calling the new function count_leading_zeros_64, how about builtin_clzl, i.e., without __? Even though intrinsic functions like __builtin_clzl are not part of the C standard, they are quite established among the popular compilers such as GCC/ICC/clang (with MSVC being the notable exception) so it would be good if we don't have to learn a new concept.
@PingXie
I think you have a good point on config.h being a bit too stretched to host these intrinsic functions. What do you think about introducing a new intrinsic.c file and moving your new function there?
This way looks good to me
how about builtin_clzl, i.e., without __? Even though intrinsic functions like __builtin_clzl are not part of the C standard, they are quite established among the popular compilers such as GCC/ICC/clang (with MSVC being the notable exception) so it would be good if we don't have to learn a new concept.
This also LGTM
Why not put these functions in util.h?
Why not put these functions in util.h?
These all LGTM. Should I put them in this patch or draft another one after this is merged?
Why not put these functions in util.h?
Different update cadence/stability.
'Intrinsic' are not typical helper functions but more focused on performance. They rarely need to be updated once written. There are many of them and we could start using more.
So I would like to tug them all in the same place that I can easily skip.
So, from the discussion above, should I extract them in current patch or we can move forward?
So, from the discussion above, should I extract them in current patch or we can move forward?
@zuiderkwast do you have a strong opinion against intrinsic.c?
From my end, it would be good to go if you could move this new function into a new intrinsic.c file.
@zuiderkwast do you have a strong opinion against
intrinsic.c?
No, I don't mind.
So, from the discussion here, I need add a intrinsics.h and put CountingTrailingZeros in intrinsics.h
My remaining questions:
- Like https://github.com/valkey-io/valkey/pull/846#issuecomment-2287786840 , should I put other intrinsics here?
@PingXie I've added the intrinsics.h now
@PingXie Comment resolved, mind take a look again?
@enjoy-binbin @madolson could we move forward and check in this? Or there're more things I need to do?
I think @pingxie will merge it, although there are a bunch of test failures I am looking through. (Seem like some old issues, let me re-trigger them)
There are still some test failures but I don't think they are related.
Thanks all!