Related Issue: #346 Use mimalloc as an option when building as specified by make MALLOC=mimalloc or USE_MIMALLOC=yes

Apr 24 '24 16:04 WM0323

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.44%. Comparing base (443d80f) to head (e391645).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #363      +/-   ##
============================================
- Coverage     68.44%   68.44%   -0.01%     
============================================
  Files           109      109              
  Lines         61671    61672       +1     
============================================
- Hits          42212    42209       -3     
- Misses        19459    19463       +4

Files	Coverage Δ
src/server.c	`88.13% <100.00%> (+<0.01%)`	:arrow_up:
src/zmalloc.c	`83.46% <ø> (ø)`

... and 12 files with indirect coverage changes

Apr 24 '24 16:04 codecov[bot]

Thanks for the patch @WM0323!

Can you consider the alternative that doesn't require vendoring mimalloc? In general, we would like to de-vendor the dependencies going forward. You can find the devendoring discussion at #15.

Apr 24 '24 17:04 PingXie

Thanks for the patch @WM0323!

Can you consider the alternative that doesn't require vendoring mimalloc? In general, we would like to de-vendor the dependencies going forward. You can find the devendoring discussion at #15.

Thanks for your feedback and directing me to the devendoring discussion. I will try to adjust the build scripts to link against a system-wide installation, managing mimalloc externally

Apr 24 '24 17:04 WM0323

Add support to optionally use system libraries instead of the vendored mimalloc.

Apr 24 '24 18:04 WM0323

I think we should also perform certain benchmark(s) to determine the memory usage/fragmentation/throughput while using mimalloc compared to jemalloc. Will help the community understand when to use mimalloc over jemalloc for Valkey.

Apr 25 '24 18:04 hpatro

Add support to optionally use system libraries instead of the vendored mimalloc.

I see that this PR allows the user to choose between vendoring and not. Any reason to NOT make the lib the only option?

Apr 25 '24 20:04 PingXie

Add support to optionally use system libraries instead of the vendored mimalloc.

I see that this PR allows the user to choose between vendoring and not. Any reason to NOT make the lib the only option?

I agree with making the lib as the only option for the devendoring:). I will make changes to keep only one system lib opion and also run some benchmark to compare with jemalloc's memory usage/fragmentation/throughput

Apr 26 '24 17:04 WM0323

Performance Test: mimalloc vs. jemalloc

Install the library and header files of mimalloc:

git clone https://github.com/microsoft/mimalloc.git
mkdir -p out/release
cd out/release
cmake ../.. -DMI_INSTALL_TOPLEVEL=ON 
sudo make install

Build the Valkey

make MALLOC=mimalloc
or
make USE_MIMALLOC=yes

Start cluster

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/
./create-cluster start   
./create-cluster create

valkey benchmark used: ./valkey-benchmark -t set -n 10000000 -c 20 -p 30003

Benchmark Results:

Metric	Jemalloc	mimalloc	Change
Requests per second	70,063.34	78,664.59	+12.28%
Average Latency (ms)	0.185	0.176	-4.86%
Used Memory (Human Readable)	3.50M	3.52M	+0.57%
Memory Fragmentation Ratio	2.82	2.33	-17.38%
Memory Allocator Version	jemalloc-5.3.0	mimalloc-1.8.5

Jemalloc logs:

paas@dsde05:~/sher/valkey/src$ ./valkey-benchmark -t set -n 10000000 -c 20 -p 30003 ====== SET ====== 10000000 requests completed in 142.73 seconds 20 parallel clients 3 bytes payload keep alive: 1 host configuration "save": 3600 1 300 100 60 10000 host configuration "appendonly": yes multi-thread: no

Latency by percentile distribution: 0.000% <= 0.079 milliseconds (cumulative count 2) 50.000% <= 0.183 milliseconds (cumulative count 5405076) 75.000% <= 0.207 milliseconds (cumulative count 7644872) 87.500% <= 0.231 milliseconds (cumulative count 8934659) 93.750% <= 0.239 milliseconds (cumulative count 9442491) 96.875% <= 0.247 milliseconds (cumulative count 9734950) 98.438% <= 0.271 milliseconds (cumulative count 9855272) 99.219% <= 0.383 milliseconds (cumulative count 9922465) 99.609% <= 0.479 milliseconds (cumulative count 9963908) 99.805% <= 0.535 milliseconds (cumulative count 9981729) 99.902% <= 0.599 milliseconds (cumulative count 9990600) 99.951% <= 0.687 milliseconds (cumulative count 9995294) 99.976% <= 0.775 milliseconds (cumulative count 9997677) 99.988% <= 0.903 milliseconds (cumulative count 9998792) 99.994% <= 7.671 milliseconds (cumulative count 9999393) 99.997% <= 7.911 milliseconds (cumulative count 9999704) 99.998% <= 8.159 milliseconds (cumulative count 9999849) 99.999% <= 9.895 milliseconds (cumulative count 9999927) 100.000% <= 10.007 milliseconds (cumulative count 9999964) 100.000% <= 10.071 milliseconds (cumulative count 9999982) 100.000% <= 10.135 milliseconds (cumulative count 9999991) 100.000% <= 10.191 milliseconds (cumulative count 9999996) 100.000% <= 10.231 milliseconds (cumulative count 9999998) 100.000% <= 10.255 milliseconds (cumulative count 9999999) 100.000% <= 10.279 milliseconds (cumulative count 10000000) 100.000% <= 10.279 milliseconds (cumulative count 10000000)

Cumulative distribution of latencies: 0.059% <= 0.103 milliseconds (cumulative count 5931) 76.449% <= 0.207 milliseconds (cumulative count 7644872) 98.828% <= 0.303 milliseconds (cumulative count 9882762) 99.324% <= 0.407 milliseconds (cumulative count 9932391) 99.729% <= 0.503 milliseconds (cumulative count 9972912) 99.912% <= 0.607 milliseconds (cumulative count 9991180) 99.958% <= 0.703 milliseconds (cumulative count 9995837) 99.981% <= 0.807 milliseconds (cumulative count 9998127) 99.988% <= 0.903 milliseconds (cumulative count 9998792) 99.990% <= 1.007 milliseconds (cumulative count 9998985) 99.990% <= 1.103 milliseconds (cumulative count 9999047) 99.991% <= 1.207 milliseconds (cumulative count 9999090) 99.991% <= 1.303 milliseconds (cumulative count 9999114) 99.991% <= 1.407 milliseconds (cumulative count 9999133) 99.992% <= 1.503 milliseconds (cumulative count 9999151) 99.992% <= 1.607 milliseconds (cumulative count 9999172) 99.992% <= 1.703 milliseconds (cumulative count 9999186) 99.992% <= 1.807 milliseconds (cumulative count 9999203) 99.992% <= 1.903 milliseconds (cumulative count 9999215) 99.992% <= 2.007 milliseconds (cumulative count 9999229) 99.992% <= 2.103 milliseconds (cumulative count 9999242) 99.993% <= 3.103 milliseconds (cumulative count 9999320) 99.998% <= 8.103 milliseconds (cumulative count 9999837) 99.999% <= 9.103 milliseconds (cumulative count 9999880) 100.000% <= 10.103 milliseconds (cumulative count 9999989) 100.000% <= 11.103 milliseconds (cumulative count 10000000)

Summary: throughput summary: 70063.34 requests per second latency summary (msec): avg min p50 p95 p99 max 0.185 0.072 0.183 0.247 0.335 10.279

paas@dsde05:~/sher/valkey/src$ ./valkey-cli -p 30003 127.0.0.1:30003> info memory #Memory used_memory:3669720 used_memory_human:3.50M used_memory_rss:10362880 used_memory_rss_human:9.88M used_memory_peak:4431264 used_memory_peak_human:4.23M used_memory_peak_perc:82.81% used_memory_overhead:3595952 used_memory_startup:2232208 used_memory_dataset:73768 used_memory_dataset_perc:5.13% allocator_allocated:4016352 allocator_active:4325376 allocator_resident:15462400 allocator_muzzy:0 total_system_memory:405353803776 total_system_memory_human:377.52G used_memory_lua:31744 used_memory_vm_eval:31744 used_memory_lua_human:31.00K used_memory_scripts_eval:0 number_of_cached_scripts:0 number_of_functions:0 number_of_libraries:0 used_memory_vm_functions:33792 used_memory_vm_total:65536 used_memory_vm_total_human:64.00K used_memory_functions:184 used_memory_scripts:184 used_memory_scripts_human:184B maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction allocator_frag_ratio:1.08 allocator_frag_bytes:309024 allocator_rss_ratio:3.57 allocator_rss_bytes:11137024 rss_overhead_ratio:0.67 rss_overhead_bytes:-5099520 mem_fragmentation_ratio:2.82 mem_fragmentation_bytes:6693360 mem_not_counted_for_evict:15008 mem_replication_backlog:1048592 mem_total_replication_buffers:1066208 mem_clients_slaves:17632 mem_clients_normal:22400 mem_cluster_links:10720 mem_aof_buffer:1536 mem_allocator: jemalloc-5.3.0 mem_overhead_db_hashtable_rehashing:0 active_defrag_running:0 lazyfree_pending_objects:0 lazyfreed_objects:0 127.0.0.1:30003>

mimalloc logs:

paas@dsde05:~/sher/valkey/src$ ./valkey-benchmark -t set -n 10000000 -c 20 -p 30003 ====== SET ====== 10000000 requests completed in 127.12 seconds 20 parallel clients 3 bytes payload keep alive: 1 host configuration "save": 3600 1 300 100 60 10000 host configuration "appendonly": yes multi-thread: no

Latency by percentile distribution: 0.000% <= 0.047 milliseconds (cumulative count 1) 50.000% <= 0.175 milliseconds (cumulative count 5194078) 75.000% <= 0.207 milliseconds (cumulative count 7538394) 87.500% <= 0.231 milliseconds (cumulative count 9072429) 93.750% <= 0.239 milliseconds (cumulative count 9411489) 96.875% <= 0.255 milliseconds (cumulative count 9712046) 98.438% <= 0.271 milliseconds (cumulative count 9856645) 99.219% <= 0.351 milliseconds (cumulative count 9922487) 99.609% <= 0.455 milliseconds (cumulative count 9961749) 99.805% <= 0.519 milliseconds (cumulative count 9981356) 99.902% <= 0.575 milliseconds (cumulative count 9990483) 99.951% <= 0.647 milliseconds (cumulative count 9995220) 99.976% <= 0.735 milliseconds (cumulative count 9997581) 99.988% <= 0.847 milliseconds (cumulative count 9998830) 99.994% <= 1.055 milliseconds (cumulative count 9999398) 99.997% <= 7.727 milliseconds (cumulative count 9999697) 99.998% <= 7.919 milliseconds (cumulative count 9999848) 99.999% <= 9.487 milliseconds (cumulative count 9999925) 100.000% <= 9.599 milliseconds (cumulative count 9999962) 100.000% <= 17.551 milliseconds (cumulative count 9999981) 100.000% <= 17.647 milliseconds (cumulative count 9999991) 100.000% <= 17.711 milliseconds (cumulative count 9999996) 100.000% <= 17.759 milliseconds (cumulative count 9999998) 100.000% <= 17.775 milliseconds (cumulative count 9999999) 100.000% <= 17.791 milliseconds (cumulative count 10000000) 100.000% <= 17.791 milliseconds (cumulative count 10000000)

Cumulative distribution of latencies: 2.994% <= 0.103 milliseconds (cumulative count 299447) 75.384% <= 0.207 milliseconds (cumulative count 7538394) 99.028% <= 0.303 milliseconds (cumulative count 9902814) 99.436% <= 0.407 milliseconds (cumulative count 9943610) 99.774% <= 0.503 milliseconds (cumulative count 9977426) 99.932% <= 0.607 milliseconds (cumulative count 9993172) 99.969% <= 0.703 milliseconds (cumulative count 9996941) 99.985% <= 0.807 milliseconds (cumulative count 9998491) 99.991% <= 0.903 milliseconds (cumulative count 9999072) 99.993% <= 1.007 milliseconds (cumulative count 9999311) 99.994% <= 1.103 milliseconds (cumulative count 9999441) 99.995% <= 1.207 milliseconds (cumulative count 9999478) 99.995% <= 1.303 milliseconds (cumulative count 9999512) 99.995% <= 1.407 milliseconds (cumulative count 9999546) 99.996% <= 1.503 milliseconds (cumulative count 9999577) 99.996% <= 1.607 milliseconds (cumulative count 9999595) 99.996% <= 1.703 milliseconds (cumulative count 9999616) 99.996% <= 1.807 milliseconds (cumulative count 9999631) 99.996% <= 1.903 milliseconds (cumulative count 9999639) 99.996% <= 2.007 milliseconds (cumulative count 9999640) 99.999% <= 8.103 milliseconds (cumulative count 9999880) 100.000% <= 10.103 milliseconds (cumulative count 9999980) 100.000% <= 18.111 milliseconds (cumulative count 10000000)

Summary: throughput summary: 78664.59 requests per second latency summary (msec): avg min p50 p95 p99 max 0.176 0.040 0.175 0.247 0.303 17.791

paas@dsde05:~/sher/valkey/src$ ./valkey-cli -p 30003 127.0.0.1:30003> info memory #Memory used_memory:3686128 used_memory_human:3.52M used_memory_rss:8605696 used_memory_rss_human:8.21M used_memory_peak:4449360 used_memory_peak_human:4.24M used_memory_peak_perc:82.85% used_memory_overhead:3611528 used_memory_startup:2248664 used_memory_dataset:74600 used_memory_dataset_perc:5.19% allocator_allocated:3685904 allocator_active:8573952 allocator_resident:8573952 allocator_muzzy:0 total_system_memory:405353803776 total_system_memory_human:377.52G used_memory_lua:31744 used_memory_vm_eval:31744 used_memory_lua_human:31.00K used_memory_scripts_eval:0 number_of_cached_scripts:0 number_of_functions:0 number_of_libraries:0 used_memory_vm_functions:33792 used_memory_vm_total:65536 used_memory_vm_total_human:64.00K used_memory_functions:200 used_memory_scripts:200 used_memory_scripts_human:200B maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction allocator_frag_ratio:1.00 allocator_frag_bytes:0 allocator_rss_ratio:1.00 allocator_rss_bytes:0 rss_overhead_ratio:1.00 rss_overhead_bytes:31744 mem_fragmentation_ratio:2.33 mem_fragmentation_bytes:4919792 mem_not_counted_for_evict:14112 mem_replication_backlog:1048592 mem_total_replication_buffers:1066208 mem_clients_slaves:17632 mem_clients_normal:22400 mem_cluster_links:10720 mem_aof_buffer:640 mem_allocator:mimalloc-1.8.5 mem_overhead_db_hashtable_rehashing:0 active_defrag_running:0 lazyfree_pending_objects:0 lazyfreed_objects:0 127.0.0.1:30003>

Apr 26 '24 20:04 WM0323

What happens with fragmentation on mimalloc allocator?

Mimalloc efficiently manages fragmentation through strategies like segment reuse and object migration. However, unlike jemalloc, it does not support manual defragmentation commands.

Apr 29 '24 15:04 WM0323

Should we introduce a defrag_supported field to INFO command? It would be handy for users to figure out if the server supports it or not. Currently I think setting CONFIG SET ACTIVEDEFRAG YES would have no effect on the server.

Sure, It would clearly indicate whether active defragmentation is supported, preventing confusion. I will add this to INFO command

Apr 29 '24 15:04 WM0323

What happens with fragmentation on mimalloc allocator?

Mimalloc efficiently manages fragmentation through strategies like segment reuse and object migration. However, unlike jemalloc, it does not support manual defragmentation commands.

Came across this discussion on mimalloc https://github.com/microsoft/mimalloc/issues/632. Looks like Dragonfly folks have made some changes by vendoring mimalloc and improving the utilization. @PingXie What are your thoughts?

Apr 29 '24 17:04 hpatro

What happens with fragmentation on mimalloc allocator?

Mimalloc efficiently manages fragmentation through strategies like segment reuse and object migration. However, unlike jemalloc, it does not support manual defragmentation commands.

Came across this discussion on mimalloc microsoft/mimalloc#632. Looks like Dragonfly folks have made some changes by vendoring mimalloc and improving the utilization. @PingXie What are your thoughts?

This is essentially the same argument for vendoring jemalloc (#364) in the first place.

I think the key question is whether we believe this is going to be a common case such that the value of vendoring outweighs its overhead. I am sure fragmentation can be an issue for certain workloads but are we looking at 0.0.1%, 0.1%, 1%, or 10%? Also how bad a more generic defrag solution would be? Does there exist a generic solution that works for all allocators and gives us say 90% of "effectiveness", or 80%, 70%, etc?

Vendoring itself is a concern to me and vendoring another dependency to the already long list of vendored dependencies is another concern to me. Our default stance should be "no vendoring". Exceptions are inevitable but there needs to be a very compelling argument IMO.

Apr 30 '24 05:04 PingXie

@WM0323 , I'm curious to get your thoughts on what maybe some the reasons that the rust community reached the opposite conclusion - jemalloc is faster and consumes less memory than mimalloc. See discussion

To clarify, not suggesting that this change shouldn't be accepted. Maybe there is something that we can learn around conducting such tests.

May 01 '24 02:05 yairgott

Some thoughts around improving the benchmark signal:

Consider incorporating SET commands with variant payload size.
The benchmark should exercise free more often. Consider issuing a mix of DEL and SET commands.
Consider looking at the real time spent in the allocators. Overall benchmark number results may provide less consistent result as those are subject to the host noise. This is also aligned with the way the Rust community did their evaluation.

May 01 '24 15:05 yairgott

Thanks for your input @yairgott . This change just allows users to use mimalloc but doesn't add it as a default. I think we should be fine supporting it. I agree with the benchmark request, the workload requested above would be more closer to real world scenario and how it would handle fragmentation.

May 01 '24 20:05 hpatro

valkey
valkey copied to clipboard

Add support for compiling with mimalloc

Codecov Report

Benchmark Results:

Jemalloc logs:

mimalloc logs:

valkey valkey copied to clipboard

Add support for compiling with mimalloc

Codecov Report

Benchmark Results:

Jemalloc logs:

mimalloc logs:

valkey
valkey copied to clipboard