valkey Introduce Thread-local storage variable to replace atomic when update used

Description

This patch try to introduce Thread-local storage variable to replace atomic for zmalloc to reduce unnecessary contention.

Problem Statement

zmalloc and zfree related functions will update the used_memory for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio of zmalloc and zfree are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed by lock add and lock sub , they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.

Performance Impact

Test Env

OS: CentOS Stream 8
Kernel: 6.2.0
Platform: Intel Xeon Platinum 8380
Server and Client in same socket

Start Server

taskset -c 0 ~/valkey/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

By using the benchmark memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml. memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --command "XADD __key__ MAXLEN ~ 1 * field __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

We can observe more than 6% QPS gain.

For the other benchmarks SET/GET, using the commands like: taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set,get -d 100 -r 1000000 -n 1000000 -c 50 --threads 4

No perf gain and regression.

With pipeline enabled, I can observe 4% perf gain with test case. taskset -c 4-7 memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

Apr 12 '24 05:04 zhulipeng

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

Just update the server config and SET/GET results in top comment.

Apr 15 '24 01:04 zhulipeng

@valkey-io/core-team, could you help to take a look at this patch?

Apr 17 '24 00:04 zhulipeng

Hi @enjoy-binbin , thanks for your comments for this patch. BTW, your blog in yuque help me a lot in understanding redis/valkey.

i did not take a deep look, how do we handle module threads?

Sorry I missed the modules thread, maybe I can remove the explicit call of the zmalloc_register_thread_index in start routine, initialize static __thread int thread_index = -1 and then add a checker if (thread_index == -1) zmalloc_register_thread_index() in zmalloc/zfree?

Apr 22 '24 03:04 zhulipeng

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.19%. Comparing base (b546dd2) to head (aeb8159). Report is 12 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #308      +/-   ##
============================================
- Coverage     70.22%   70.19%   -0.03%     
============================================
  Files           110      110              
  Lines         60039    60066      +27     
============================================
+ Hits          42163    42165       +2     
- Misses        17876    17901      +25

Files	Coverage Δ
src/evict.c	`97.34% <100.00%> (ø)`
src/zmalloc.c	`85.44% <100.00%> (+0.80%)`	:arrow_up:

... and 14 files with indirect coverage changes

Apr 22 '24 09:04 codecov[bot]

@enjoy-binbin @PingXie @madolson Thoughts about this patch?

Apr 28 '24 02:04 zhulipeng

Kindly ping @valkey-io/core-team.

May 31 '24 06:05 zhulipeng

In the Valkey Contributor Summit, someone (who?) suggested that we remove this counter in zmalloc and instead rely on the stats from the allocator, like je_mallctl("stats.allocated"). I tend to agree with that. zmalloc is in a hot code path and it'd be good to avoid tracking this. The only difference seems to be that we can see how much was allocated using zmalloc vs allocated in other ways (by libs, etc.). Is there any good reason to see the difference between these?

The INFO fields are used_memory (from zmalloc counter) vs allocator_allocated (stats from allocator). I imagine we can simply let used_memory be identical to allocator_allocated.

Jun 13 '24 07:06 zuiderkwast

I think there are still 2 comments to be addressed

the hard-coded max thread number
zmalloc_used_memory can be even more inaccurate now with the PR. I kind of like @lipzhu's proposal at https://github.com/valkey-io/valkey/pull/308#discussion_r1566641483 but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.

je_mallctl("stats.allocated")

@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?

Jun 13 '24 07:06 PingXie

I imagine we can simply let used_memory be identical to allocator_allocated.

@zuiderkwast I am not clear the full history of used_memory, but I saw comments like below, the used_memory is designed to be different with allocator_allocated? And seems we only have allocator_allocated with jemalloc now.

    /* Unlike zmalloc_used_memory, this matches the stats.resident by taking
     * into account all allocations done by this process (not only zmalloc). */
    je_mallctl("stats.allocated", allocated, &sz, NULL, 0);

But I noticed that the metrics of used_memory and allocator_allocated are very closed during the benchmark test.

info memory
used_memory:46356571552
allocator_allocated:46356604048

the hard-coded max thread number

@PingXie Yes, I am also a little confused here, according to the suggestion https://github.com/valkey-io/valkey/pull/308#discussion_r1564891132, but how do we handle the modules, is there any way to get the modules number to be a factor of max thread number?

zmalloc_used_memory can be even more inaccurate now with the PR.

Can you be more specific.

I kind of like @lipzhu's proposal at https://github.com/valkey-io/valkey/pull/308#discussion_r1566641483 but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.

I am OK with any proposal if your core-team make the final decision.

@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?

Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.

Jun 13 '24 08:06 zhulipeng

Thanks for putting some focus on this performance bottleneck!

I commented in the issue #467.

I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.

Jun 13 '24 17:06 zuiderkwast

Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.

Sorry I meant to ask if you had a chance to evaluate the overhead of je_mallctl("stats.allocated")

I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.

I think we should get a perf reading of this change first.

Thanks for putting some focus on this performance bottleneck!

+1. Appreciate it, @lipzhu!

Jun 13 '24 18:06 PingXie

@zuiderkwast @PingXie Update the patch as suggested in https://github.com/valkey-io/valkey/issues/467.

Jun 14 '24 11:06 zhulipeng

It seems we have a problem on macos:

zmalloc.c:55:10: fatal error: 'threads.h' file not found
#include <threads.h>
         ^~~~~~~~~~~

See https://stackoverflow.com/questions/16244153/clang-c11-threads-h-not-found

Maybe we need some conditional compilation like

#if __STDC_NO_THREADS__
#define thread_local __thread
#else
#include <threads.h>
#endif

Jun 14 '24 13:06 zuiderkwast

I have some concerns:

Now the threshold is 1MB, if this threshold is not reached, used_memory would not be updated. Some special cases, for example lazyfree, bio only frees 900KB memory, then no new jobs comes, used_memory will be inaccurate forever. I think we need a mechanism to compensate in order to achieve eventual consistency, not only through size thresholds but also time thresholds, such as updating the delta every 1 second.
We have many threads, main thread, bio, io-threads, module threads. Too many threads may lead to larger errors, as in the situation in point 1, the error is the number of threads times 1MB. If the problem in point 1 can be resolved, then this would not be an issue either.

Jun 18 '24 07:06 soloestoy

also time thresholds, such as updating the delta every 1 second

@soloestoy Isn't checking time in the allocator function too slow? It is a very hot code path.

Some alternative ideas:

Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.
Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.

Jun 18 '24 10:06 zuiderkwast

Isn't checking time in the allocator function too slow? It is a very hot code path.

yes, it would be best if every thread can register a timer to trigger updating delta every second.

Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.

maybe 10KB is an acceptable error.

Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.

native threads can work well, but module's thread is a problem.

Jun 19 '24 08:06 soloestoy

it would be best if every thread can register a timer to trigger updating delta every second.

@soloestoy How to do this per thread in a simple way? Signal handler?

maybe 10KB is an acceptable error.

Let's do this then? I think it can provide much of the benefit already.

native threads can work well, but module's thread is a problem.

Simple solution: We can add do explicit flush to the atomic counter in ValkeyModule_Alloc and ValkeyModule_Free.

Jun 19 '24 09:06 zuiderkwast

@soloestoy The PR is based on the assumption that we can accept used_memory is not accurate, the worst gap is just as you mentioned threads_num * 1M, you can refer @PingXie and @zuiderkwast discussion https://github.com/valkey-io/valkey/issues/467 for more details.

If we are not alignment of the absolute used_memory, I want to recall another proposal (actually the initial version of this PR), use a thread-local storage to maintain each thread's own used_memory and flush to used_memory through zmalloc_used_memory(), it is a bit like the opposite of rwlock, write parallelly and read exclusively. I create another https://github.com/valkey-io/valkey/pull/674 to demonstrate the proposal for you core members to review and compare.

Jun 19 '24 12:06 zhulipeng

valkey
valkey copied to clipboard

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention.

Description

Problem Statement

Performance Impact

Test Env

Start Server

Codecov Report

valkey valkey copied to clipboard

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention.

Description

Problem Statement

Performance Impact

Test Env

Start Server

Codecov Report

valkey
valkey copied to clipboard