valkey defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code

Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the computeDefragCycles and not for every single ptr, this way we are avoiding the additional api call. Followup work will utilize the new options that are now open and will further improve the defrag decision and process.

Added files:

allocator_defrag.c / allocator_defrag.h - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.

Moved functions:

zmalloc_no_tcache , zfree_no_tcache - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc]. frag_smallbins_bytes - the logic and implementation moved to the new file

Existing API:

[once a second + when completed full cycle] computeDefragCycles
- zmalloc_get_allocator_info : gets from jemalloc allocated, active, resident, retained, muzzy, frag_smallbins_bytes
- frag_smallbins_bytes : for each bin; gets from jemalloc bin_info, curr_regs, cur_slabs
[during defrag, for each pointer]
- je_defrag_hint is getting a memory pointer and returns {0,1} . Internally it uses this information points:
  - #nonfull_slabs
  - #total_slabs
  - #free regs in the ptr slab

Jemalloc API (via ctl interface)

[BATCH]experimental_utilization_batch_query_ctl : gets an array of pointers, returns for each pointer 3 values,

number of free regions in the extent
number of regions in the extent
size of the extent in terms of bytes

[EXTENDED]experimental_utilization_query_ctl :

memory address of the extent a potential reallocation would go into
number of free regions in the extent
number of regions in the extent
size of the extent in terms of bytes
[stats-enabled]total number of free regions in the bin the extent belongs to
[stats-enabled]total number of regions in the bin the extent belongs to

`experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`?

[good]

We can query pointers in a batch, reduce the overall overhead
The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily

[bad]

In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #nonfull_slabs and total allocated regs.

New functions:

defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:

The mallctlnametomib() function provides a way to avoid repeated name lookups for applications that repeatedly query the same portion of the namespace,by translating a name to a “Management Information Base” (MIB) that can be passed repeatedly to mallctlbymib().
jemalloc_sz2binind_lgq* : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from lg_quantum
defrag_jemalloc_get_frag_smallbins : This function replaces frag_smallbins_bytes the logic moved to the new file allocator_defrag defrag_jemalloc_should_defrag_multi → handle_results - unpacks the results
should_defrag : implements the same logic as the existing implementation inside je_defrag_hint
defrag_jemalloc_should_defrag_multi : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.

Logical differences:

In order to get the information about #nonfull_slabs and #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use jemalloc_sz2binind_lgq* .

Testing

This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.

Test:

disable activedefrag
run valkey benchmark on overlapping address ranges with different block sizes
wait untill used_memory reaches 10GB
set maxmemory to 5GB and maxmemory-policy to allkeys-lru
stop load
wait for mem_fragmentation_ratio to reach 2
enable activedefrag - start test timer
wait until reach mem_fragmentation_ratio = 1.1

Results*:

(With this PR)Test results: 56 sec (Without this PR)Test results: 67 sec

*both runs perform same "work" number of buffers moved to reach fragmentation target

Next benchmarking is to compare to:

DONE // existing je_get_defrag_hint
compare with naive defrag all: int defrag_hint() {return 1;}

Jun 25 '24 07:06 zvi-code

Codecov Report

Attention: Patch coverage is 93.56725% with 11 lines in your changes missing coverage. Please review.

Project coverage is 70.78%. Comparing base (3c32ee1) to head (955da04). Report is 1 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/allocator_defrag.c	92.99%	11 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #692      +/-   ##
============================================
+ Coverage     70.65%   70.78%   +0.12%     
============================================
  Files           114      115       +1     
  Lines         63158    63284     +126     
============================================
+ Hits          44624    44794     +170     
+ Misses        18534    18490      -44

Files with missing lines	Coverage Δ
src/defrag.c	`86.64% <100.00%> (+1.98%)`	:arrow_up:
src/server.c	`87.75% <100.00%> (+0.05%)`	:arrow_up:
src/zmalloc.c	`82.60% <100.00%> (-2.07%)`	:arrow_down:
src/allocator_defrag.c	`92.99% <92.99%> (ø)`

... and 13 files with indirect coverage changes

Jun 26 '24 02:06 codecov[bot]

@valkey-io/core-team this is the change we discussed in the last summit about aligning JEMALLOC vanila. Can you take a look and state if that makes sense before we start diving deeper to the review?

Jun 26 '24 05:06 ranshid

Thanks, @zvi-code! This is fantastic news!

@valkey-io/core-team, can we consider including this change in version 8.2? Devendoring jemalloc would enable us to leverage the optimized jemalloc library tailored for the target platform, along with all the benefits of devendoring. However, I believe incorporating this into version 8.0 would be quite challenging given our current timeline and the existing PR backlog. I want to make sure we are all on the same page regarding expectations.

Jul 02 '24 05:07 PingXie

@PingXie , Thanks. My only concern with waiting for 8.2 is that I have several followup improvements to defrag mechanism that will be delayed as a result. The changes are in several aspects [some examples from the top of my head]:

Reduce the cost of defrag per byte of defragmented memory. Do so by: 1) improving the iteration of the defrag to be memory access friendly 2) utilize the batch support to reduce the cost of call to jemalloc 3) fine tune the use of tcache in defrag alloc/free 4) run defrag in parallel with worker threads
Improve the when and how defrag is running: for example: 1) prioritize free over defrag 2) if many free where done to a bin re-evaluate the defrag quota 3) add qos into defrag to smooth the impact on latency+tail latency 4) no point to defrag before we completed purge/flush 5) make adjustment to defrag based on progress achieved in full key space scan completion + consider alloc/dealloc activity during the scan cycle period [for example if we made no progress, we can predict with high probability that we will not make progress also in next run] ++
Improve the what to defrag decision algorithm: 1) differentiate between even bin distribution and balanced [these require different prioritization cause the solve different problem] 2) consider change velocity 3) consider total locked capacity in bin ++

memory fragmentation is still a big problem in many use cases and I want valkey to be top performer in this aspect because it greatly affects the usability of memory resource. Ideally I want it to work so well that we will all agree to enabled it by default (if supported)

Jul 02 '24 11:07 zvi-code

Thanks for the additional context, @zvi-code. I am aligned, directionally.

For transparency, we are going to cut our first RC in a few weeks in anticipation of a fall GA. So my concern is purely on the risk management side and I am not sure if we could converge on this PR soon. That said, I have marked this PR for "major-decision-pending" (in the sense of "when" as opposed to "whether"). Will check out the PR next.

Jul 02 '24 16:07 PingXie

@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.

Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)

Jul 02 '24 16:07 zuiderkwast

@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.

Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)

@zuiderkwast & @PingXie appriciate your reply, make sense!

Jul 02 '24 16:07 zvi-code

We are now just cherry-picking changes from unstable into the 8.0 branch, so we can resume investigating this now if it makes sense to go into the next minor release.

Sep 06 '24 16:09 madolson

@madolson wrote a more elaborated overview in the top comment please let me know if it helps reading through this change

Sep 17 '24 11:09 zvi-code

It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?

Yes with adjustment to differences between the API's

Oct 09 '24 04:10 zvi-code

This doesn't seem like a major decision, so removing the tag. If we were to de-vendor it, we would want to see that in a separate PR.

Oct 14 '24 14:10 madolson

Will post a PR with fixes to existing comments later this week

Oct 20 '24 08:10 zvi-code

forced push by mistake

Nov 05 '24 12:11 zvi-code

@madolson , @zuiderkwast , @ranshid, I created a new PR with current code and a separate commit for CR fixes. Seems I used wrong update method (rebased) and I could not align the branch without encountering issues.

Nov 05 '24 15:11 zvi-code

defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code

Summary of the change

Added files:

Moved functions:

Existing API:

Jemalloc API (via ctl interface)

experimental_utilization_batch_query_ctl vs valkey je_defrag_hint?

New functions:

Logical differences:

Testing

Test:

Results*:

Codecov Report

`experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`?