defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code
Summary of the change
This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.
This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the computeDefragCycles and not for every single ptr, this way we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will further improve the defrag decision and process.
Added files:
allocator_defrag.c / allocator_defrag.h - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.
Moved functions:
zmalloc_no_tcache , zfree_no_tcache - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].
frag_smallbins_bytes - the logic and implementation moved to the new file
Existing API:
- [once a second + when completed full cycle]
computeDefragCycleszmalloc_get_allocator_info: gets from jemalloc allocated, active, resident, retained, muzzy,frag_smallbins_bytesfrag_smallbins_bytes: for each bin; gets from jemalloc bin_info,curr_regs,cur_slabs
- [during defrag, for each pointer]
je_defrag_hintis getting a memory pointer and returns {0,1} . Internally it uses this information points:- #
nonfull_slabs - #
total_slabs - #free regs in the ptr slab
- #
Jemalloc API (via ctl interface)
[BATCH]experimental_utilization_batch_query_ctl : gets an array of pointers, returns for each pointer 3 values,
- number of free regions in the extent
- number of regions in the extent
- size of the extent in terms of bytes
[EXTENDED]experimental_utilization_query_ctl :
- memory address of the extent a potential reallocation would go into
- number of free regions in the extent
- number of regions in the extent
- size of the extent in terms of bytes
- [stats-enabled]total number of free regions in the bin the extent belongs to
- [stats-enabled]total number of regions in the bin the extent belongs to
experimental_utilization_batch_query_ctl vs valkey je_defrag_hint?
[good]
- We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily
[bad]
- In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #
nonfull_slabsand total allocated regs.
New functions:
-
defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:The mallctlnametomib() function provides a way to avoid repeated name lookups for applications that repeatedly query the same portion of the namespace,by translating a name to a “Management Information Base” (MIB) that can be passed repeatedly to mallctlbymib().
-
jemalloc_sz2binind_lgq*: this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived fromlg_quantum -
defrag_jemalloc_get_frag_smallbins: This function replacesfrag_smallbins_bytesthe logic moved to the new file allocator_defragdefrag_jemalloc_should_defrag_multi→handle_results- unpacks the results -
should_defrag: implements the same logic as the existing implementation inside je_defrag_hint -
defrag_jemalloc_should_defrag_multi: implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.
Logical differences:
In order to get the information about #nonfull_slabs and #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use jemalloc_sz2binind_lgq* .
Testing
This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.
Test:
- disable
activedefrag - run valkey benchmark on overlapping address ranges with different block sizes
- wait untill
used_memoryreaches 10GB - set
maxmemoryto 5GB andmaxmemory-policytoallkeys-lru - stop load
- wait for
mem_fragmentation_ratioto reach 2 - enable
activedefrag- start test timer - wait until reach
mem_fragmentation_ratio= 1.1
Results*:
(With this PR)Test results: 56 sec
(Without this PR)Test results: 67 sec
*both runs perform same "work" number of buffers moved to reach fragmentation target
Next benchmarking is to compare to:
- DONE // existing
je_get_defrag_hint - compare with naive defrag all:
int defrag_hint() {return 1;}
Codecov Report
Attention: Patch coverage is 93.56725% with 11 lines in your changes missing coverage. Please review.
Project coverage is 70.78%. Comparing base (
3c32ee1) to head (955da04). Report is 1 commits behind head on unstable.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/allocator_defrag.c | 92.99% | 11 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## unstable #692 +/- ##
============================================
+ Coverage 70.65% 70.78% +0.12%
============================================
Files 114 115 +1
Lines 63158 63284 +126
============================================
+ Hits 44624 44794 +170
+ Misses 18534 18490 -44
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/defrag.c | 86.64% <100.00%> (+1.98%) |
:arrow_up: |
| src/server.c | 87.75% <100.00%> (+0.05%) |
:arrow_up: |
| src/zmalloc.c | 82.60% <100.00%> (-2.07%) |
:arrow_down: |
| src/allocator_defrag.c | 92.99% <92.99%> (ø) |
@valkey-io/core-team this is the change we discussed in the last summit about aligning JEMALLOC vanila. Can you take a look and state if that makes sense before we start diving deeper to the review?
Thanks, @zvi-code! This is fantastic news!
@valkey-io/core-team, can we consider including this change in version 8.2? Devendoring jemalloc would enable us to leverage the optimized jemalloc library tailored for the target platform, along with all the benefits of devendoring. However, I believe incorporating this into version 8.0 would be quite challenging given our current timeline and the existing PR backlog. I want to make sure we are all on the same page regarding expectations.
@PingXie , Thanks. My only concern with waiting for 8.2 is that I have several followup improvements to defrag mechanism that will be delayed as a result. The changes are in several aspects [some examples from the top of my head]:
- Reduce the
cost of defragper byte ofdefragmented memory. Do so by: 1) improving the iteration of the defrag to be memory access friendly 2) utilize the batch support to reduce the cost of call to jemalloc 3) fine tune the use of tcache in defrag alloc/free 4) run defrag in parallel with worker threads - Improve the when and how defrag is running: for example: 1) prioritize free over defrag 2) if many free where done to a bin re-evaluate the defrag
quota3) add qos into defrag to smooth the impact on latency+tail latency 4) no point to defrag before we completed purge/flush 5) make adjustment to defrag based on progress achieved in full key space scan completion + consider alloc/dealloc activity during the scan cycle period [for example if we made no progress, we can predict with high probability that we will not make progress also in next run] ++ - Improve the what to defrag decision algorithm: 1) differentiate between even bin distribution and balanced [these require different prioritization cause the solve different problem] 2) consider change velocity 3) consider total locked capacity in bin ++
memory fragmentation is still a big problem in many use cases and I want valkey to be top performer in this aspect because it greatly affects the usability of memory resource. Ideally I want it to work so well that we will all agree to enabled it by default (if supported)
Thanks for the additional context, @zvi-code. I am aligned, directionally.
For transparency, we are going to cut our first RC in a few weeks in anticipation of a fall GA. So my concern is purely on the risk management side and I am not sure if we could converge on this PR soon. That said, I have marked this PR for "major-decision-pending" (in the sense of "when" as opposed to "whether"). Will check out the PR next.
@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.
Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)
@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.
Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)
@zuiderkwast & @PingXie appriciate your reply, make sense!
We are now just cherry-picking changes from unstable into the 8.0 branch, so we can resume investigating this now if it makes sense to go into the next minor release.
@madolson wrote a more elaborated overview in the top comment please let me know if it helps reading through this change
It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?
Yes with adjustment to differences between the API's
This doesn't seem like a major decision, so removing the tag. If we were to de-vendor it, we would want to see that in a separate PR.
Will post a PR with fixes to existing comments later this week
forced push by mistake
@madolson , @zuiderkwast , @ranshid, I created a new PR with current code and a separate commit for CR fixes. Seems I used wrong update method (rebased) and I could not align the branch without encountering issues.