dev3: mimalloc touches uncommitted memory in very rare cases
Hi Daan! Thanks for quickly addressing the issue with touching uncommitted memory, it greatly helped!
However, I'm still observing very rare crash in mimalloc with the same symptoms: mimalloc tried to write to address that is MEM_RESERVE, but not MEM_COMMIT...
It happens with such callstack:
_mi_memset(void *,int,unsigned __int64)
internal.h:1102
_mi_memzero(void *,unsigned __int64)
internal.h:1150
mi_arenas_page_alloc_fresh(unsigned __int64,unsigned __int64,unsigned __int64,mi_arena_s *,int,bool,mi_tld_s *)
arena.c:682
mi_arenas_page_singleton_alloc(mi_heap_s *,unsigned __int64,unsigned __int64)
arena.c:757
mi_page_fresh_alloc(mi_heap_s *,mi_page_queue_s *,unsigned __int64,unsigned __int64)
page.c:305
mi_huge_page_alloc(mi_heap_s *,unsigned __int64,unsigned __int64,mi_page_queue_s *)
page.c:897
mi_find_page(mi_heap_s *,unsigned __int64,unsigned __int64)
page.c:925
_mi_malloc_generic(mi_heap_s *,unsigned __int64,bool,unsigned __int64)
page.c:968
_mi_heap_realloc_zero(mi_heap_s *,void *,unsigned __int64,bool)
And I suspect something is off with this code:
// claimed free slices: initialize the page partly
if (!memid.initially_zero && memid.initially_committed) {
mi_track_mem_undefined(page, slice_count * MI_ARENA_SLICE_SIZE);
_mi_memzero_aligned(page, sizeof(*page));
}
Maybe initially_committed was set to true mistakenly?...
Could you take a look, please?
It seems like in _mi_os_alloc_aligned we initialize mi_memid_t to _mi_memid_none() which leaves initially_committed in uninitialized state? ...
It seems like in
_mi_os_alloc_alignedwe initializemi_memid_tto_mi_memid_none()which leavesinitially_committedin uninitialized state? ...
I don't think this is it as we reinitialize on success with mi_memid_create_os. I am not sure but since the trace shows a huge singleton block, it might be that only the commit is failing? I added an extra check for the return value of mi_arena_commit.
Still, it is strange since I think the commit parameter is true at mi_arenas_page_alloc_fresh (as we come from singleton) and therefore it should be committed -- if you can break on the failing memzero it would be great if you can see if the memid points to OS memory or arena memory. TBC
Unfortunately, I'm unable to repro it :( It's a crash I've got from the wild. The fix looks promising, tho! I will give it a try a get back to you... Thanks!
Just pushed some more improvements to handling of commit failure in dev3 -- hope it will fix the issue but not sure ..
Hi @daanx , the crash is still present :( I've inspected code manually, and it seems like here (in arena.c), we forget to unset the "committed" bit in case commit failes:
if (!mi_arena_commit(arena, p, mi_size_of_slices(slice_count), &commit_zero, mi_size_of_slices(slice_count - already_committed_count))) {
// if the commit fails, we roll back
_mi_arenas_free( p, mi_size_of_slices(slice_count), *memid); // this will uncommit as well
// need to call mi_bitmap_clearN(arena->slices_committed, slice_index, slice_count); here? <-------- HERE
return NULL;
}
Hi @Noxybot -- ah yes! .. but we already found this issue last Tuesday and fixed it in the dev3-cdb branch -- I will port it to the dev3 branch soon! (apologies -- was just too busy over the past days). I hope that will fix the crash you are seeing!
Hi @Noxybot -- ah yes! .. but we already found this issue last Tuesday and fixed it in the
dev3-cdbbranch -- I will port it to thedev3branch soon! (apologies -- was just too busy over the past days). I hope that will fix the crash you are seeing!
Thanks for confirming. I’m looking forward to have this fix in dev3!
I pushed v3.1.5 that should fix this now.
Hi @Daanx, the latest dev3 is much better (~8 times less crashes), however there are still some:
- nullptr access in
mi_page_map_set_range:
mi_page_t** sub = mi_page_map_ensure_submap_at(idx);
// set the offsets for the page
while (sub_idx < MI_PAGE_MAP_SUB_COUNT) {
sub[sub_idx] = page; //<---- sub might NULL here (see mi_page_map_ensure_submap_at)
slice_count--; if (slice_count == 0) return;
sub_idx++;
}
mi_page_map_ensure_submap_at can return NULL:
const size_t submap_size = MI_PAGE_MAP_SUB_SIZE;
sub = (mi_page_t**)_mi_os_zalloc(submap_size, &memid);
if (sub == NULL) {
_mi_error_message(EFAULT, "internal error: unable to extend the page map\n");
return NULL;
}
- There seem to be some rare crashes in
_mi_page_malloc_zero/mi_page_thread_collect_to_local, but it's hard to tell what's exactly wrong there. (Both crashes seems to be close to where we callmi_block_next).
Sorry, I haven't shared callstack, which is extremely unhelpful :) Here is the callstack for the first issue:
1
mi_page_map_set_range(mi_page_s *,unsigned __int64,unsigned __int64,unsigned __int64)
page-map.c:321
2
mi_arenas_page_alloc_fresh(unsigned __int64,unsigned __int64,unsigned __int64,mi_arena_s *,int,bool,mi_tld_s *)
arena.c:716
3
mi_arenas_page_singleton_alloc(mi_heap_s *,unsigned __int64,unsigned __int64)
arena.c:767
4
mi_page_fresh_alloc(mi_heap_s *,mi_page_queue_s *,unsigned __int64,unsigned __int64)
page.c:306
5
mi_huge_page_alloc(mi_heap_s *,unsigned __int64,unsigned __int64,mi_page_queue_s *)
page.c:897
6
mi_find_page(mi_heap_s *,unsigned __int64,unsigned __int64)
page.c:925
7
_mi_malloc_generic(mi_heap_s *,unsigned __int64,bool,unsigned __int64)
page.c:968
8
_mi_heap_realloc_zero(mi_heap_s *,void *,unsigned __int64,bool)
And here is the callstack for second issue:
1
mi_page_thread_collect_to_local(mi_page_s *,mi_block_s *)
page.c:151
2
mi_page_thread_free_collect(mi_page_s *)
page.c:184
3
_mi_page_free_collect(mi_page_s *,bool)
page.c:191
4
_mi_heap_page_reclaim(mi_heap_s *,mi_page_s *)
page.c:277
5
mi_free_try_collect_mt(mi_page_s *,mi_block_s *)
I fixed the first issue in the latest dev3. The second one is mysterious... I hope fixing the first issue fixes that one as well since I have no initial idea what the cause may be there.
Hi Daan, the latest dev3 is much better (the best we've seen so far), however, there still seems to be a tiny crash the reason for which is unclear...
[0x0] App.exe!mi_page_thread_collect_to_local+0x1f() (Inline Function) (Inline Function)
[0x1] App.exe!mi_page_thread_free_collect+0x42() (Inline Function) (Inline Function)
[0x2] App.exe!_mi_page_free_collect+0x4c(mi_page_s * page = 0x2982ede0000, bool force = false) 0x297ebc5f1c0 0x7ff728e0d50b
[0x3] App.exe!_mi_heap_page_reclaim+0x4b(mi_heap_s * heap = 0x29781ea9380, mi_page_s * page = 0x2982ede0000) 0x297ebc5f1f0 0x7ff728e03ae6
[0x4] App.exe!mi_free_try_collect_mt+0x336(mi_page_s * page = 0x2982ede0000, mt_free = <unavailable>) 0x297ebc5f220 0x7ff728dfe012
It seems like at some point mi_block_next returns invalid next block.