valkey
valkey copied to clipboard
Add support for compiling with mimalloc
Related Issue: #346
Use mimalloc as an option when building as specified by make MALLOC=mimalloc or USE_MIMALLOC=yes
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 68.44%. Comparing base (
443d80f) to head (e391645).
Additional details and impacted files
@@ Coverage Diff @@
## unstable #363 +/- ##
============================================
- Coverage 68.44% 68.44% -0.01%
============================================
Files 109 109
Lines 61671 61672 +1
============================================
- Hits 42212 42209 -3
- Misses 19459 19463 +4
| Files | Coverage Δ | |
|---|---|---|
| src/server.c | 88.13% <100.00%> (+<0.01%) |
:arrow_up: |
| src/zmalloc.c | 83.46% <ø> (ø) |
Thanks for the patch @WM0323!
Can you consider the alternative that doesn't require vendoring mimalloc? In general, we would like to de-vendor the dependencies going forward. You can find the devendoring discussion at #15.
Thanks for the patch @WM0323!
Can you consider the alternative that doesn't require vendoring mimalloc? In general, we would like to de-vendor the dependencies going forward. You can find the devendoring discussion at #15.
Thanks for your feedback and directing me to the devendoring discussion. I will try to adjust the build scripts to link against a system-wide installation, managing mimalloc externally
Add support to optionally use system libraries instead of the vendored mimalloc.
I think we should also perform certain benchmark(s) to determine the memory usage/fragmentation/throughput while using mimalloc compared to jemalloc. Will help the community understand when to use mimalloc over jemalloc for Valkey.
Add support to optionally use system libraries instead of the vendored mimalloc.
I see that this PR allows the user to choose between vendoring and not. Any reason to NOT make the lib the only option?
Add support to optionally use system libraries instead of the vendored mimalloc.
I see that this PR allows the user to choose between vendoring and not. Any reason to NOT make the lib the only option?
I agree with making the lib as the only option for the devendoring:). I will make changes to keep only one system lib opion and also run some benchmark to compare with jemalloc's memory usage/fragmentation/throughput
Performance Test: mimalloc vs. jemalloc
Install the library and header files of mimalloc:
git clone https://github.com/microsoft/mimalloc.git
mkdir -p out/release
cd out/release
cmake ../.. -DMI_INSTALL_TOPLEVEL=ON
sudo make install
Build the Valkey
make MALLOC=mimalloc
or
make USE_MIMALLOC=yes
Start cluster
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/
./create-cluster start
./create-cluster create
valkey benchmark used:
./valkey-benchmark -t set -n 10000000 -c 20 -p 30003
Benchmark Results:
| Metric | Jemalloc | mimalloc | Change |
|---|---|---|---|
| Requests per second | 70,063.34 | 78,664.59 | +12.28% |
| Average Latency (ms) | 0.185 | 0.176 | -4.86% |
| Used Memory (Human Readable) | 3.50M | 3.52M | +0.57% |
| Memory Fragmentation Ratio | 2.82 | 2.33 | -17.38% |
| Memory Allocator Version | jemalloc-5.3.0 | mimalloc-1.8.5 |
Jemalloc logs:
paas@dsde05:~/sher/valkey/src$ ./valkey-benchmark -t set -n 10000000 -c 20 -p 30003 ====== SET ====== 10000000 requests completed in 142.73 seconds 20 parallel clients 3 bytes payload keep alive: 1 host configuration "save": 3600 1 300 100 60 10000 host configuration "appendonly": yes multi-thread: no
Latency by percentile distribution: 0.000% <= 0.079 milliseconds (cumulative count 2) 50.000% <= 0.183 milliseconds (cumulative count 5405076) 75.000% <= 0.207 milliseconds (cumulative count 7644872) 87.500% <= 0.231 milliseconds (cumulative count 8934659) 93.750% <= 0.239 milliseconds (cumulative count 9442491) 96.875% <= 0.247 milliseconds (cumulative count 9734950) 98.438% <= 0.271 milliseconds (cumulative count 9855272) 99.219% <= 0.383 milliseconds (cumulative count 9922465) 99.609% <= 0.479 milliseconds (cumulative count 9963908) 99.805% <= 0.535 milliseconds (cumulative count 9981729) 99.902% <= 0.599 milliseconds (cumulative count 9990600) 99.951% <= 0.687 milliseconds (cumulative count 9995294) 99.976% <= 0.775 milliseconds (cumulative count 9997677) 99.988% <= 0.903 milliseconds (cumulative count 9998792) 99.994% <= 7.671 milliseconds (cumulative count 9999393) 99.997% <= 7.911 milliseconds (cumulative count 9999704) 99.998% <= 8.159 milliseconds (cumulative count 9999849) 99.999% <= 9.895 milliseconds (cumulative count 9999927) 100.000% <= 10.007 milliseconds (cumulative count 9999964) 100.000% <= 10.071 milliseconds (cumulative count 9999982) 100.000% <= 10.135 milliseconds (cumulative count 9999991) 100.000% <= 10.191 milliseconds (cumulative count 9999996) 100.000% <= 10.231 milliseconds (cumulative count 9999998) 100.000% <= 10.255 milliseconds (cumulative count 9999999) 100.000% <= 10.279 milliseconds (cumulative count 10000000) 100.000% <= 10.279 milliseconds (cumulative count 10000000)
Cumulative distribution of latencies: 0.059% <= 0.103 milliseconds (cumulative count 5931) 76.449% <= 0.207 milliseconds (cumulative count 7644872) 98.828% <= 0.303 milliseconds (cumulative count 9882762) 99.324% <= 0.407 milliseconds (cumulative count 9932391) 99.729% <= 0.503 milliseconds (cumulative count 9972912) 99.912% <= 0.607 milliseconds (cumulative count 9991180) 99.958% <= 0.703 milliseconds (cumulative count 9995837) 99.981% <= 0.807 milliseconds (cumulative count 9998127) 99.988% <= 0.903 milliseconds (cumulative count 9998792) 99.990% <= 1.007 milliseconds (cumulative count 9998985) 99.990% <= 1.103 milliseconds (cumulative count 9999047) 99.991% <= 1.207 milliseconds (cumulative count 9999090) 99.991% <= 1.303 milliseconds (cumulative count 9999114) 99.991% <= 1.407 milliseconds (cumulative count 9999133) 99.992% <= 1.503 milliseconds (cumulative count 9999151) 99.992% <= 1.607 milliseconds (cumulative count 9999172) 99.992% <= 1.703 milliseconds (cumulative count 9999186) 99.992% <= 1.807 milliseconds (cumulative count 9999203) 99.992% <= 1.903 milliseconds (cumulative count 9999215) 99.992% <= 2.007 milliseconds (cumulative count 9999229) 99.992% <= 2.103 milliseconds (cumulative count 9999242) 99.993% <= 3.103 milliseconds (cumulative count 9999320) 99.998% <= 8.103 milliseconds (cumulative count 9999837) 99.999% <= 9.103 milliseconds (cumulative count 9999880) 100.000% <= 10.103 milliseconds (cumulative count 9999989) 100.000% <= 11.103 milliseconds (cumulative count 10000000)
Summary: throughput summary: 70063.34 requests per second latency summary (msec): avg min p50 p95 p99 max 0.185 0.072 0.183 0.247 0.335 10.279
paas@dsde05:~/sher/valkey/src$ ./valkey-cli -p 30003 127.0.0.1:30003> info memory #Memory used_memory:3669720 used_memory_human:3.50M used_memory_rss:10362880 used_memory_rss_human:9.88M used_memory_peak:4431264 used_memory_peak_human:4.23M used_memory_peak_perc:82.81% used_memory_overhead:3595952 used_memory_startup:2232208 used_memory_dataset:73768 used_memory_dataset_perc:5.13% allocator_allocated:4016352 allocator_active:4325376 allocator_resident:15462400 allocator_muzzy:0 total_system_memory:405353803776 total_system_memory_human:377.52G used_memory_lua:31744 used_memory_vm_eval:31744 used_memory_lua_human:31.00K used_memory_scripts_eval:0 number_of_cached_scripts:0 number_of_functions:0 number_of_libraries:0 used_memory_vm_functions:33792 used_memory_vm_total:65536 used_memory_vm_total_human:64.00K used_memory_functions:184 used_memory_scripts:184 used_memory_scripts_human:184B maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction allocator_frag_ratio:1.08 allocator_frag_bytes:309024 allocator_rss_ratio:3.57 allocator_rss_bytes:11137024 rss_overhead_ratio:0.67 rss_overhead_bytes:-5099520 mem_fragmentation_ratio:2.82 mem_fragmentation_bytes:6693360 mem_not_counted_for_evict:15008 mem_replication_backlog:1048592 mem_total_replication_buffers:1066208 mem_clients_slaves:17632 mem_clients_normal:22400 mem_cluster_links:10720 mem_aof_buffer:1536 mem_allocator: jemalloc-5.3.0 mem_overhead_db_hashtable_rehashing:0 active_defrag_running:0 lazyfree_pending_objects:0 lazyfreed_objects:0 127.0.0.1:30003>
mimalloc logs:
paas@dsde05:~/sher/valkey/src$ ./valkey-benchmark -t set -n 10000000 -c 20 -p 30003 ====== SET ====== 10000000 requests completed in 127.12 seconds 20 parallel clients 3 bytes payload keep alive: 1 host configuration "save": 3600 1 300 100 60 10000 host configuration "appendonly": yes multi-thread: no
Latency by percentile distribution: 0.000% <= 0.047 milliseconds (cumulative count 1) 50.000% <= 0.175 milliseconds (cumulative count 5194078) 75.000% <= 0.207 milliseconds (cumulative count 7538394) 87.500% <= 0.231 milliseconds (cumulative count 9072429) 93.750% <= 0.239 milliseconds (cumulative count 9411489) 96.875% <= 0.255 milliseconds (cumulative count 9712046) 98.438% <= 0.271 milliseconds (cumulative count 9856645) 99.219% <= 0.351 milliseconds (cumulative count 9922487) 99.609% <= 0.455 milliseconds (cumulative count 9961749) 99.805% <= 0.519 milliseconds (cumulative count 9981356) 99.902% <= 0.575 milliseconds (cumulative count 9990483) 99.951% <= 0.647 milliseconds (cumulative count 9995220) 99.976% <= 0.735 milliseconds (cumulative count 9997581) 99.988% <= 0.847 milliseconds (cumulative count 9998830) 99.994% <= 1.055 milliseconds (cumulative count 9999398) 99.997% <= 7.727 milliseconds (cumulative count 9999697) 99.998% <= 7.919 milliseconds (cumulative count 9999848) 99.999% <= 9.487 milliseconds (cumulative count 9999925) 100.000% <= 9.599 milliseconds (cumulative count 9999962) 100.000% <= 17.551 milliseconds (cumulative count 9999981) 100.000% <= 17.647 milliseconds (cumulative count 9999991) 100.000% <= 17.711 milliseconds (cumulative count 9999996) 100.000% <= 17.759 milliseconds (cumulative count 9999998) 100.000% <= 17.775 milliseconds (cumulative count 9999999) 100.000% <= 17.791 milliseconds (cumulative count 10000000) 100.000% <= 17.791 milliseconds (cumulative count 10000000)
Cumulative distribution of latencies: 2.994% <= 0.103 milliseconds (cumulative count 299447) 75.384% <= 0.207 milliseconds (cumulative count 7538394) 99.028% <= 0.303 milliseconds (cumulative count 9902814) 99.436% <= 0.407 milliseconds (cumulative count 9943610) 99.774% <= 0.503 milliseconds (cumulative count 9977426) 99.932% <= 0.607 milliseconds (cumulative count 9993172) 99.969% <= 0.703 milliseconds (cumulative count 9996941) 99.985% <= 0.807 milliseconds (cumulative count 9998491) 99.991% <= 0.903 milliseconds (cumulative count 9999072) 99.993% <= 1.007 milliseconds (cumulative count 9999311) 99.994% <= 1.103 milliseconds (cumulative count 9999441) 99.995% <= 1.207 milliseconds (cumulative count 9999478) 99.995% <= 1.303 milliseconds (cumulative count 9999512) 99.995% <= 1.407 milliseconds (cumulative count 9999546) 99.996% <= 1.503 milliseconds (cumulative count 9999577) 99.996% <= 1.607 milliseconds (cumulative count 9999595) 99.996% <= 1.703 milliseconds (cumulative count 9999616) 99.996% <= 1.807 milliseconds (cumulative count 9999631) 99.996% <= 1.903 milliseconds (cumulative count 9999639) 99.996% <= 2.007 milliseconds (cumulative count 9999640) 99.999% <= 8.103 milliseconds (cumulative count 9999880) 100.000% <= 10.103 milliseconds (cumulative count 9999980) 100.000% <= 18.111 milliseconds (cumulative count 10000000)
Summary: throughput summary: 78664.59 requests per second latency summary (msec): avg min p50 p95 p99 max 0.176 0.040 0.175 0.247 0.303 17.791
paas@dsde05:~/sher/valkey/src$ ./valkey-cli -p 30003 127.0.0.1:30003> info memory #Memory used_memory:3686128 used_memory_human:3.52M used_memory_rss:8605696 used_memory_rss_human:8.21M used_memory_peak:4449360 used_memory_peak_human:4.24M used_memory_peak_perc:82.85% used_memory_overhead:3611528 used_memory_startup:2248664 used_memory_dataset:74600 used_memory_dataset_perc:5.19% allocator_allocated:3685904 allocator_active:8573952 allocator_resident:8573952 allocator_muzzy:0 total_system_memory:405353803776 total_system_memory_human:377.52G used_memory_lua:31744 used_memory_vm_eval:31744 used_memory_lua_human:31.00K used_memory_scripts_eval:0 number_of_cached_scripts:0 number_of_functions:0 number_of_libraries:0 used_memory_vm_functions:33792 used_memory_vm_total:65536 used_memory_vm_total_human:64.00K used_memory_functions:200 used_memory_scripts:200 used_memory_scripts_human:200B maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction allocator_frag_ratio:1.00 allocator_frag_bytes:0 allocator_rss_ratio:1.00 allocator_rss_bytes:0 rss_overhead_ratio:1.00 rss_overhead_bytes:31744 mem_fragmentation_ratio:2.33 mem_fragmentation_bytes:4919792 mem_not_counted_for_evict:14112 mem_replication_backlog:1048592 mem_total_replication_buffers:1066208 mem_clients_slaves:17632 mem_clients_normal:22400 mem_cluster_links:10720 mem_aof_buffer:640 mem_allocator:mimalloc-1.8.5 mem_overhead_db_hashtable_rehashing:0 active_defrag_running:0 lazyfree_pending_objects:0 lazyfreed_objects:0 127.0.0.1:30003>
- What happens with fragmentation on mimalloc allocator?
Mimalloc efficiently manages fragmentation through strategies like segment reuse and object migration. However, unlike jemalloc, it does not support manual defragmentation commands.
- Should we introduce a
defrag_supportedfield toINFOcommand? It would be handy for users to figure out if the server supports it or not. Currently I think settingCONFIG SET ACTIVEDEFRAG YESwould have no effect on the server.
Sure, It would clearly indicate whether active defragmentation is supported, preventing confusion. I will add this to INFO command
- What happens with fragmentation on mimalloc allocator?
Mimalloc efficiently manages fragmentation through strategies like segment reuse and object migration. However, unlike jemalloc, it does not support manual defragmentation commands.
Came across this discussion on mimalloc https://github.com/microsoft/mimalloc/issues/632. Looks like Dragonfly folks have made some changes by vendoring mimalloc and improving the utilization. @PingXie What are your thoughts?
- What happens with fragmentation on mimalloc allocator?
Mimalloc efficiently manages fragmentation through strategies like segment reuse and object migration. However, unlike jemalloc, it does not support manual defragmentation commands.
Came across this discussion on mimalloc microsoft/mimalloc#632. Looks like Dragonfly folks have made some changes by vendoring mimalloc and improving the utilization. @PingXie What are your thoughts?
This is essentially the same argument for vendoring jemalloc (#364) in the first place.
I think the key question is whether we believe this is going to be a common case such that the value of vendoring outweighs its overhead. I am sure fragmentation can be an issue for certain workloads but are we looking at 0.0.1%, 0.1%, 1%, or 10%? Also how bad a more generic defrag solution would be? Does there exist a generic solution that works for all allocators and gives us say 90% of "effectiveness", or 80%, 70%, etc?
Vendoring itself is a concern to me and vendoring another dependency to the already long list of vendored dependencies is another concern to me. Our default stance should be "no vendoring". Exceptions are inevitable but there needs to be a very compelling argument IMO.
@WM0323 , I'm curious to get your thoughts on what maybe some the reasons that the rust community reached the opposite conclusion - jemalloc is faster and consumes less memory than mimalloc. See discussion
To clarify, not suggesting that this change shouldn't be accepted. Maybe there is something that we can learn around conducting such tests.
Some thoughts around improving the benchmark signal:
- Consider incorporating SET commands with variant payload size.
- The benchmark should exercise
freemore often. Consider issuing a mix of DEL and SET commands. - Consider looking at the
real timespent in the allocators. Overall benchmark number results may provide less consistent result as those are subject to the host noise. This is also aligned with the way the Rust community did their evaluation.
Thanks for your input @yairgott . This change just allows users to use mimalloc but doesn't add it as a default. I think we should be fine supporting it.
I agree with the benchmark request, the workload requested above would be more closer to real world scenario and how it would handle fragmentation.