valkey
valkey copied to clipboard
Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention.
Description
This patch try to introduce Thread-local storage variable to replace atomic for zmalloc to reduce unnecessary contention.
Problem Statement
zmalloc and zfree related functions will update the used_memory for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio of zmalloc and zfree are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed by lock add and lock sub , they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.
Performance Impact
Test Env
- OS: CentOS Stream 8
- Kernel: 6.2.0
- Platform: Intel Xeon Platinum 8380
- Server and Client in same socket
Start Server
taskset -c 0 ~/valkey/src/valkey-server /tmp/valkey_1.conf
port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
By using the benchmark memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml.
memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --command "XADD __key__ MAXLEN ~ 1 * field __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram
We can observe more than 6% QPS gain.
For the other benchmarks SET/GET, using the commands like:
taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set,get -d 100 -r 1000000 -n 1000000 -c 50 --threads 4
No perf gain and regression.
With pipeline enabled, I can observe 4% perf gain with test case.
taskset -c 4-7 memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram
What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?
Just update the server config and SET/GET results in top comment.
@valkey-io/core-team, could you help to take a look at this patch?
Hi @enjoy-binbin , thanks for your comments for this patch. BTW, your blog in yuque help me a lot in understanding redis/valkey.
i did not take a deep look, how do we handle module threads?
Sorry I missed the modules thread, maybe I can remove the explicit call of the zmalloc_register_thread_index in start routine, initialize static __thread int thread_index = -1 and then add a checker if (thread_index == -1) zmalloc_register_thread_index() in zmalloc/zfree?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 70.19%. Comparing base (
b546dd2) to head (aeb8159). Report is 12 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #308 +/- ##
============================================
- Coverage 70.22% 70.19% -0.03%
============================================
Files 110 110
Lines 60039 60066 +27
============================================
+ Hits 42163 42165 +2
- Misses 17876 17901 +25
| Files | Coverage Δ | |
|---|---|---|
| src/evict.c | 97.34% <100.00%> (ø) |
|
| src/zmalloc.c | 85.44% <100.00%> (+0.80%) |
:arrow_up: |
@enjoy-binbin @PingXie @madolson Thoughts about this patch?
Kindly ping @valkey-io/core-team.
In the Valkey Contributor Summit, someone (who?) suggested that we remove this counter in zmalloc and instead rely on the stats from the allocator, like je_mallctl("stats.allocated"). I tend to agree with that. zmalloc is in a hot code path and it'd be good to avoid tracking this. The only difference seems to be that we can see how much was allocated using zmalloc vs allocated in other ways (by libs, etc.). Is there any good reason to see the difference between these?
The INFO fields are used_memory (from zmalloc counter) vs allocator_allocated (stats from allocator). I imagine we can simply let used_memory be identical to allocator_allocated.
I think there are still 2 comments to be addressed
- the hard-coded max thread number
zmalloc_used_memorycan be even more inaccurate now with the PR. I kind of like @lipzhu's proposal at https://github.com/valkey-io/valkey/pull/308#discussion_r1566641483 but I think there is a bug in the code snippet - shouldn'tused_memory_threadbe cleared after getting applied toused_memory? In other words, I thinkused_memory_threadshould be used to track/cache the deltas that haven't been applied to the global counter.
je_mallctl("stats.allocated")
@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?
I imagine we can simply let used_memory be identical to allocator_allocated.
@zuiderkwast I am not clear the full history of used_memory, but I saw comments like below, the used_memory is designed to be different with allocator_allocated? And seems we only have allocator_allocated with jemalloc now.
/* Unlike zmalloc_used_memory, this matches the stats.resident by taking
* into account all allocations done by this process (not only zmalloc). */
je_mallctl("stats.allocated", allocated, &sz, NULL, 0);
But I noticed that the metrics of used_memory and allocator_allocated are very closed during the benchmark test.
info memory
used_memory:46356571552
allocator_allocated:46356604048
the hard-coded max thread number
@PingXie Yes, I am also a little confused here, according to the suggestion https://github.com/valkey-io/valkey/pull/308#discussion_r1564891132, but how do we handle the modules, is there any way to get the modules number to be a factor of max thread number?
zmalloc_used_memory can be even more inaccurate now with the PR.
Can you be more specific.
I kind of like @lipzhu's proposal at https://github.com/valkey-io/valkey/pull/308#discussion_r1566641483 but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.
I am OK with any proposal if your core-team make the final decision.
@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?
Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.
Thanks for putting some focus on this performance bottleneck!
I commented in the issue #467.
I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.
Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.
Sorry I meant to ask if you had a chance to evaluate the overhead of je_mallctl("stats.allocated")
I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.
I think we should get a perf reading of this change first.
Thanks for putting some focus on this performance bottleneck!
+1. Appreciate it, @lipzhu!
@zuiderkwast @PingXie Update the patch as suggested in https://github.com/valkey-io/valkey/issues/467.
It seems we have a problem on macos:
zmalloc.c:55:10: fatal error: 'threads.h' file not found
#include <threads.h>
^~~~~~~~~~~
See https://stackoverflow.com/questions/16244153/clang-c11-threads-h-not-found
Maybe we need some conditional compilation like
#if __STDC_NO_THREADS__
#define thread_local __thread
#else
#include <threads.h>
#endif
I have some concerns:
- Now the threshold is 1MB, if this threshold is not reached,
used_memorywould not be updated. Some special cases, for example lazyfree, bio only frees 900KB memory, then no new jobs comes,used_memorywill be inaccurate forever. I think we need a mechanism to compensate in order to achieve eventual consistency, not only through size thresholds but also time thresholds, such as updating the delta every 1 second. - We have many threads, main thread, bio, io-threads, module threads. Too many threads may lead to larger errors, as in the situation in point 1, the error is the number of threads times 1MB. If the problem in point 1 can be resolved, then this would not be an issue either.
- also time thresholds, such as updating the delta every 1 second
@soloestoy Isn't checking time in the allocator function too slow? It is a very hot code path.
Some alternative ideas:
- Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.
- Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.
Isn't checking time in the allocator function too slow? It is a very hot code path.
yes, it would be best if every thread can register a timer to trigger updating delta every second.
Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.
maybe 10KB is an acceptable error.
Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.
native threads can work well, but module's thread is a problem.
it would be best if every thread can register a timer to trigger updating delta every second.
@soloestoy How to do this per thread in a simple way? Signal handler?
maybe 10KB is an acceptable error.
Let's do this then? I think it can provide much of the benefit already.
native threads can work well, but module's thread is a problem.
Simple solution: We can add do explicit flush to the atomic counter in ValkeyModule_Alloc and ValkeyModule_Free.
@soloestoy The PR is based on the assumption that we can accept used_memory is not accurate, the worst gap is just as you mentioned threads_num * 1M, you can refer @PingXie and @zuiderkwast discussion https://github.com/valkey-io/valkey/issues/467 for more details.
If we are not alignment of the absolute used_memory, I want to recall another proposal (actually the initial version of this PR), use a thread-local storage to maintain each thread's own used_memory and flush to used_memory through zmalloc_used_memory(), it is a bit like the opposite of rwlock, write parallelly and read exclusively. I create another
https://github.com/valkey-io/valkey/pull/674 to demonstrate the proposal for you core members to review and compare.