valkey Free strings during BGSAVE

Motivation

Copy-on-write (COW) amplification refers to the issue where writing to a small object leads to the entire page being cloned, resulting in inefficient memory usage. This issue arises during the BGSAVE process, which can be particularly problematic on instances with limited memory. If the BGSAVE process could release unneeded memory, it could reduce memory consumption. To address this, the BGSAVE process calls the madvise function to signal the operating system to reclaim the buffer. However, this approach does not work for buffers smaller than a page (usually 4KiB). Even after multiple such calls, where a full page may be free, the operating system will not reclaim it. To solve this issue, we can call zfree directly. This allows the allocator (jemalloc) to handle the bookkeeping and release pages when buffers are no longer needed. This approach reduces copy-on-write events.

Benchmarks To understand how usage of zfree affects BGSAVE and the memory consumption I ran 45 benchmarks that compares my clonewith the vanilla version. The benchmark has the following steps:

Start a new Valkey process
Fill the DB with data sequentially
Run a warmup to randomize the memory layout
Introduce fragmentation by deleting part of the keys
In parallel:
1. Trigger BGSAVE
2. Start 80/20 get/set load

I played the following parameters to understand their influence:

Number of keys: 3M, 6M, and 12M.
Data size. While key themselves are of fixed length ~30 bytes, the value size is 120, 250, 500, 1000, and 2000 bytes.
Fragmentation. I delete 5%, 10%, and 15% of the original key range.

I'm attaching a graph of BGSAVE process memory consumption. Instead of all benchmarks, I show the most representative runs IMO.

bgsave-free-key-numbers

For 2000 bytes values peak memory usage is ~53% compared to vanilla. The peak happens at 57% BGSAVE progress. For 500 bytes values the peak is ~80% compared to vanilla. And happens at ~80% progress. For 120 bytes the difference is under 5%, and the patched version could even use more memory.

bgsave-free-key-sizes

For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark. For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark. For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark.

Changes

The PR contains 2 changes:

Static buffer for RDB comrpession. RDB compression leads to COW events even without any write load if we use zfree. It happens because the compression functions allocates a new buffer for each object. Together with freeing objects with zfree it leads to reusing of the memory shared with the main process. To deal with this problem, we use a static buffer for compression. If the object size is too big for the static buffer, than it falls back to the ad hoc allocation behavior.
Freeing string objects instead of dismissing them Call to zfree is more expensive than direct call to madvise. But with #453 strings use the fast path – zfree_with_size. As a possible next step we can optimize zfree for other data types as well.

Aug 13 '24 17:08 poiuj

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.74%. Comparing base (a939cb8) to head (b4446ec). Report is 6 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #905      +/-   ##
============================================
+ Coverage     70.59%   70.74%   +0.14%     
============================================
  Files           117      118       +1     
  Lines         63324    63386      +62     
============================================
+ Hits          44704    44841     +137     
+ Misses        18620    18545      -75

Files with missing lines	Coverage Δ
src/object.c	`79.15% <100.00%> (ø)`
src/rdb.c	`76.57% <100.00%> (+0.32%)`	:arrow_up:

... and 19 files with indirect coverage changes

Aug 13 '24 23:08 codecov[bot]

I guess a general concern I have is that all of these benchmarks are run on systems that are being stressed. What if the write rate is very low, would we be increasing the CoW usage in the vast majority of cases?

Oct 07 '24 22:10 madolson

I guess a general concern I have is that all of these benchmarks are run on systems that are being stressed. What if the write rate is very low, would we be increasing the CoW usage in the vast majority of cases?

@madolson if the system is not stressed, is that a very big issue if the CoW is increased?

lets try and define the extra tests we need in order to gain more confidence? like :

use different value sizes (very small, medium, large)
Test the impact on CoW on idle system
Test performance impact (?)

also IMO @poiuj provided some supporting data, but I think we probably need to improve the results visualization

Oct 09 '24 10:10 ranshid

COW numbers for idle system with 3M keys, depending on value size: 3 bytes values – 0 MiB 45 bytes values – 4 MiB (~1.1% of the memory usage reported by valkey) 100 bytes values – 5 MiB (~0.9%) 250 bytes values – 27 MiB (~2.9%) 500 bytes values – 50 MiB (~3%) 1000 bytes values – 100 MiB (~3.2%)

The overhead seems to be in range of 1-3% approximately.

Oct 10 '24 16:10 poiuj

@valkey-io/core-team LGTM but maybe someone has any other concerns?

Nov 27 '24 16:11 ranshid

@poiuj can you please check how on earth did the git misspelled your name on the signoff? the last commiter name is spelled: Vadym Khoptyents Please ammend the commit in order to satisfy DCO

Nov 27 '24 16:11 ranshid

LGTM, the top commet does not seem to mention the static buffer in lzf

@poiuj lets bring that up in the top comment

Nov 27 '24 16:11 ranshid

The compression change is mentioned in the top comment:

Static buffer for RDB comrpession. RDB compression leads to COW events even without any write load if we use zfree. It happens because the compression functions allocates a new buffer for each object. Together with freeing objects with zfree it leads to reusing of the memory shared with the main process. To deal with this problem, we use a static buffer for compression. If the object size is too big for the static buffer, than it falls back to the ad hoc allocation behavior.

@enjoy-binbin @ranshid can you suggest how to clarify the message here?

Nov 27 '24 17:11 poiuj

The compression change is mentioned in the top comment:

Static buffer for RDB comrpession. RDB compression leads to COW events even without any write load if we use zfree. It happens because the compression functions allocates a new buffer for each object. Together with freeing objects with zfree it leads to reusing of the memory shared with the main process. To deal with this problem, we use a static buffer for compression. If the object size is too big for the static buffer, than it falls back to the ad hoc allocation behavior.

@enjoy-binbin @ranshid can you suggest how to clarify the message here?

Maybe just mention that you use a constant 8K buffer to hold the comressed size.

Nov 27 '24 17:11 ranshid

@poiuj one thought I just realized is that we are targetting this as a use when JEMalloc is the allocator right? in case we use libc malloc will this introduce a degradation? Noyt sure HOW important it is to avoid degradation since IMO for any operational installation of valkey jemalloc will probably be used, but maybe we can limit this optimization to JEMAlloc case?

Nov 28 '24 06:11 ranshid

in case we use libc malloc will this introduce a degradation? Noyt sure HOW important it is to avoid degradation since IMO for any operational installation of valkey jemalloc will probably be used, but maybe we can limit this optimization to JEMAlloc case?

Let's test it with libc malloc before we decide to use conditional code. Maybe it's good for other allocators too..? Btw, libc malloc is different on linux vs freebsd vs macos. (Freebsd's libc malloc is actually jemalloc.)

Nov 28 '24 09:11 zuiderkwast

in case we use libc malloc will this introduce a degradation? Noyt sure HOW important it is to avoid degradation since IMO for any operational installation of valkey jemalloc will probably be used, but maybe we can limit this optimization to JEMAlloc case?

Let's test it with libc malloc before we decide to use conditional code. Maybe it's good for other allocators too..? Btw, libc malloc is different on linux vs freebsd vs macos. (Freebsd's libc malloc is actually jemalloc.)

Yes. I know about the difference platforms and I agree we better test for it. But I also do not feel very strongly about not offering this only for jemalloc in order to reduce the blast. We often provide many changes which are jemalloc biased since this is practically the production level used allocator.

Nov 28 '24 11:11 ranshid

I've run the benchmark with libc and there is no effect. It doesn't worth to complicate the code with turning off for other allocators.

3m-fixed-libc

Nov 28 '24 13:11 poiuj

I've run the benchmark with libc and there is no effect. It doesn't worth to complicate the code with turning off for other allocators.

While I agree about code complications, as @zuiderkwast correctly mentioned it might be different impact on different platforms. I only think that if we are not going to test it on all common platforms we should atleast state that on the release notes. @zuiderkwast WDYT?

Nov 28 '24 14:11 ranshid

note - taking off the run-extra-tests flag, we already have extended tests coverage

Nov 28 '24 14:11 ranshid

The failing test passes for me locally. Talked to @ranshid about it. It's related to another issue.

Nov 28 '24 14:11 poiuj

I've run the benchmark with libc and there is no effect. It doesn't worth to complicate the code with turning off for other allocators.

@poiuj That's great.

While I agree about code complications, as @zuiderkwast correctly mentioned it might be different impact on different platforms. I only think that if we are not going to test it on all common platforms we should atleast state that on the release notes. @zuiderkwast WDYT?

@ranshid I'm fine with just mentioning this change in release notes. We don't test all possible impacts on all possible platforms.

Nov 28 '24 15:11 zuiderkwast

@poiuj Can you run a fully daily CI in your repo, we just want to ensure the test can be passed in every platform (This is only concern from my side).

Nov 28 '24 17:11 hwware

@poiuj Can you run a fully daily CI in your repo, we just want to ensure the test can be passed in every platform (This is only concern from my side).

@hwware We already had the daily runs running on this PR: https://github.com/valkey-io/valkey/actions/runs/12067940597/job/33651977697 The only failed test was related to aknown issue related to the last defrag changes and not new to this PR. I removed the tag in order to save unneeded excutions. I spoke with @zuiderkwast offline and we both agree this is fine to be merged now, given we add a note in the release notes about the fact that the change is optimizing JEMalloc builds and although not tested on all platforms we did not notice any degradation using other allocators.

Dec 01 '24 13:12 ranshid