valkey
valkey copied to clipboard
Memory Access Amortization
This PR is the last of three pull requests intended to achieve the goal of processing one million requests per second.
1st PR: https://github.com/valkey-io/valkey/pull/758 2nd PR:https://github.com/valkey-io/valkey/pull/763
With these 3 PRs, we can now handle up to 1.2 million requests per second, up from 200K without IO-threads and ~380K with the current IO threads implementation.
This PR utilizes the IO threads to execute commands in batches, allowing us to prefetch the dictionary data in advance.
After making the IO threads asynchronous and offloading more work to them in the first 2 PRs, the lookupKey function becomes a main bottle-neck and it takes about 50% of the main-thread time (Tested with SET command). This is because the Valkey dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key, or a value object requires, with high probability, an expensive external memory access.
Memory Access Amortization
(Designed and implemented by dan touitou)
Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately.
Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency.
We applied this method in the development of dictPrefetch, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, dictPrefetch is an interleaved execution of dictFind for all the keys.
Implementation details
When the main thread iterates over the clients-pending-io-read, for clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. Only then will the commands be executed.
Codecov Report
Attention: Patch coverage is 4.46927% with 171 lines in your changes missing coverage. Please review.
Project coverage is 70.27%. Comparing base (
1d18842) to head (ebbbed1). Report is 52 commits behind head on unstable.
| Files | Patch % | Lines |
|---|---|---|
| src/memory_prefetch.c | 2.45% | 159 Missing :warning: |
| src/networking.c | 15.38% | 11 Missing :warning: |
| src/io_threads.c | 0.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## unstable #861 +/- ##
============================================
- Coverage 70.35% 70.27% -0.09%
============================================
Files 112 113 +1
Lines 61467 61711 +244
============================================
+ Hits 43245 43365 +120
- Misses 18222 18346 +124
| Files | Coverage Δ | |
|---|---|---|
| src/config.c | 78.69% <ø> (ø) |
|
| src/dict.c | 97.54% <100.00%> (+0.25%) |
:arrow_up: |
| src/kvstore.c | 96.17% <100.00%> (ø) |
|
| src/server.c | 88.57% <ø> (+<0.01%) |
:arrow_up: |
| src/server.h | 100.00% <ø> (ø) |
|
| src/io_threads.c | 7.87% <0.00%> (-0.04%) |
:arrow_down: |
| src/networking.c | 88.42% <15.38%> (-0.30%) |
:arrow_down: |
| src/memory_prefetch.c | 2.45% <2.45%> (ø) |
@lipzhu Would be great if you could take a look as well.
Profiling the cache hit/miss W/o prefetch, the performance boost is impressive. @uriyage Great work.
BTW, the performance improvement is based on the situation that memory latency have been the bottle-neck of main thread, do you have evaluation if memory latency is not the bottle-neck, will the additional prefetch instruction caused regression?
# w/ prefetch
Performance counter stats for process id '1841714':
128,348,636,929 mem_load_retired.l1_hit (66.64%)
1,917,290,678 mem_load_retired.l1_miss (66.65%)
736,226,936 mem_load_retired.l2_hit (66.67%)
1,180,484,127 mem_load_retired.l2_miss (66.71%)
419,957,504 mem_load_retired.l3_hit (66.69%)
8,607,452 mem_load_retired.l3_miss (66.65%)
10.001272139 seconds time elapsed
# w/o prefetch
Performance counter stats for process id '1842658':
118,738,848,689 mem_load_retired.l1_hit (66.64%)
1,641,809,909 mem_load_retired.l1_miss (66.64%)
592,192,759 mem_load_retired.l2_hit (66.67%)
1,049,283,088 mem_load_retired.l2_miss (66.71%)
373,710,273 mem_load_retired.l3_hit (66.69%)
31,032,840 mem_load_retired.l3_miss (66.65%)
10.001325784 seconds time elapsed
BTW, the performance improvement is based on the situation that memory latency have been the bottle-neck of main thread, do you have evaluation if memory latency is not the bottle-neck, will the additional prefetch instruction caused regression?
@lipzhu I tested the same code with one key in the database, meaning we accessed the same key repeatedly. In this case, memory latency is not the bottleneck, and as expected prefetching didn't help. However, no regression was observed.
Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.
+1. To make code structure cleaner maybe we can extract the prefetch related code to a new file and apply the prefetch optimization like a rule.
LGTM outside of some existing comments and the clang format thing. Did we decide to make the batch size configurable and/or self-tuning? Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.
Also, please open the doc PR as we discussed, then we can add it to the Valkey 8 board so it doesn't get dropped and this PR is unblocked to get merged.
@madolson @lipzhu I agree, I will add a configuration for it. However, it will require some changes since the batch is currently static. I also agree that we can refactor the code into a new file. I believe I'll be able to publish the code by tomorrow. @lipzhu I'm not sure I understand what you mean by 'apply like a rule'.
@lipzhu I'm not sure I understand what you mean by 'apply like a rule'.
Sorry for the miss understanding, I mean the Rule-Based Optimizer in DBMS, inspired by SparkSQL:spark.sql.adaptive.optimizer.excludedRules. Which give user a option to apply/exclude the prefetch rule based on their real environment.
Ref:
https://spark.apache.org/docs/latest/sql-performance-tuning.html