dragonfly
dragonfly copied to clipboard
Introduce active defragmentation
Following feedback from our customers - Dragonfly suffers from external fragmentation in some cases. This problem is common for any slab allocator like tcmalloc, jemalloc and mimalloc. Consider the following scenario:
- an application allocates lots of "blocks" (from mimalloc jargon) of size 384. Then the allocator mmaps (allocates from the OS) lots of slabs that handle blocks of size 384 and those slabs take up gigabytes of memory. This is a healthy situation because
usedmemory ~= rss memory
. - Then the application frees the vast majority of blocks across all the slabs of size 384. Most slabs are "almost" free but the allocator can not return it to the OS because some blocks are still busy.
- As a result one can observe external fragmentation of the "unused" memory hidden in those slabs (or pages in mimalloc jargon). see the graph below. This situation is problematic and this is what we need to fix.
I do not see any other solution besides active defragmentation, i.e. go over all the allocated items in the heap and grouping them together, thus compressing them in fewer slabs. Something similar is done by active-defrag in Redis using je_get_defrag_hint
utility function.
We need to be smart about initial trigger conditions for the sweep and how we estimate the effectiveness of this defragmentation procedure in order not to introduce infinite loops.
The solution is to leverage helio's OnIdle
task management framework.
In addition, we should add a heuristic that can categorize which pages (areas, slabs) should be defragmented.
This is how it was initially done in redis https://github.com/redis/redis/pull/5065/commits/e8099cabd19c4e3a46c94c39e69e13191d43f5eb though it evolved later.
I suggest the following heuristic as a base line for mimalloc:
// taken from mimalloc private code.
static inline mi_slice_t* mi_slice_first(const mi_slice_t* slice) {
mi_slice_t* start = (mi_slice_t*)((uint8_t*)slice - slice->slice_offset);
return start;
}
// taken from mimalloc private code.
static inline mi_segment_t* _mi_ptr_segment(const void* p) {
return (mi_segment_t*)((uintptr_t)p & ~MI_SEGMENT_MASK);
}
#define mi_likely(x) __builtin_expect(!!(x), true)
// returns true if page is not active, and its used blocks <= capacity * ratio
bool mi_heap_page_is_underutilized(mi_heap_t* heap, void* p, float ratio) {
mi_segment_t* const segment = _mi_ptr_segment(p);
// from _mi_segment_page_of
ptrdiff_t diff = (uint8_t*)p - (uint8_t*)segment;
size_t idx = (size_t)diff >> MI_SEGMENT_SLICE_SHIFT;
mi_slice_t* slice0 = (mi_slice_t*)&segment->slices[idx];
mi_slice_t* slice = mi_slice_first(slice0); // adjust to the block that holds the page data
mi_page_t* page = (mi_page_t*)slice;
// end from _mi_segment_page_of //
// from mi_page_heap
mi_heap_t* page_heap = (mi_heap_t*)(mi_atomic_load_relaxed(&(page)->xheap));
// the heap id matches and it is not a full page
if (mi_likely(page_heap == heap && page->flags.x.in_full == 0)) {
// mi_page_queue_t* pq = mi_heap_page_queue_of(heap, page);
// first in the list, meaning it's the head of page queue, thus being used for malloc
if (page->prev == NULL)
return false;
// this page belong to this heap and is not first in the page queue. Lets check its
// utilization.
return page->used <= unsigned(page->capacity * ratio);
}
return false;
}
a. We can trigger defragmentation based on thread-local conditions like RSS/Used ratio in a specific thread.
We can easily track "used per thread" via EngineShard::UsedMemory
. RSS usage can be approximated via committed
stats in mi_heap_t.tls->stats
. The thing is that we call mi_stats_merge
from all the shard threads and I suspect it resets thread local heap stats during the merge. We should fix per-thread accounting if we want to use thread-local triggers.
b. Another option is to use process-global conditions like GetMallocCurrentCommitted
and used_mem_current
, but I prefer to explore option (a) first.
- Will add idle task to check whether we need to run this operation. In case this is not required, we will return the lowest value (meaning let it wait for as long as possible for this task), if we do need to do the operation, it would return the highest value for the task so that this task will start executing until the memory defrag is done.
- There is one issue with the above options: option a will not work, see this issue with mi_malloc. so we will restore for the second option - using global from GetMallocCurrentCommitted and the value used_mem_current to calculate whether to run this or not.
This will implemented as follow:
- Check if memory usage is high enough
- Check if diff between commited and used memory is high enough
- Check if we have memory changes (to ensure that we not running endlessly).
- if all the above pass -> run on the shard and try to de-fragment memory by re-allocating values
- if the cursor for this is signal that we are not done, schedule the task to run at high
- priority otherwise lower the task priority so that it would not use the CPU when not required
The first version will not include the actual step of the memory re-allocation, to make this a simpler version of the algorithm. The main reason for this is to be able to verify that we can use the Idle task and to identify that cases where this option is needed. Once we establish that this is indeed working in term of new task in the system, the low level details including the code above will be added.
The next step is to add the function mi_heap_page_is_underutilized as a patch to the build of third parties at helio. This would enable the implementation of the actual defragmentation. It should also fix the issue when building mimalloc as release and dragonfly as debug. It would allow adding unit tests to verify that the the function can checks whether a pointer is located at a location that can be moved.