mimalloc icon indicating copy to clipboard operation
mimalloc copied to clipboard

dev3: mimalloc touches uncommitted memory in very rare cases

Open Noxybot opened this issue 6 months ago • 12 comments

Hi Daan! Thanks for quickly addressing the issue with touching uncommitted memory, it greatly helped! However, I'm still observing very rare crash in mimalloc with the same symptoms: mimalloc tried to write to address that is MEM_RESERVE, but not MEM_COMMIT...

It happens with such callstack:

_mi_memset(void *,int,unsigned __int64)
‎internal.h:1102
_mi_memzero(void *,unsigned __int64)
‎internal.h:1150
mi_arenas_page_alloc_fresh(unsigned __int64,unsigned __int64,unsigned __int64,mi_arena_s *,int,bool,mi_tld_s *)
‎arena.c:682
mi_arenas_page_singleton_alloc(mi_heap_s *,unsigned __int64,unsigned __int64)
‎arena.c:757
mi_page_fresh_alloc(mi_heap_s *,mi_page_queue_s *,unsigned __int64,unsigned __int64)
‎page.c:305
mi_huge_page_alloc(mi_heap_s *,unsigned __int64,unsigned __int64,mi_page_queue_s *)
‎page.c:897
mi_find_page(mi_heap_s *,unsigned __int64,unsigned __int64)
‎page.c:925
_mi_malloc_generic(mi_heap_s *,unsigned __int64,bool,unsigned __int64)
‎page.c:968
_mi_heap_realloc_zero(mi_heap_s *,void *,unsigned __int64,bool)

And I suspect something is off with this code:

  // claimed free slices: initialize the page partly
  if (!memid.initially_zero && memid.initially_committed) {
    mi_track_mem_undefined(page, slice_count * MI_ARENA_SLICE_SIZE);
    _mi_memzero_aligned(page, sizeof(*page));
  }

Maybe initially_committed was set to true mistakenly?... Could you take a look, please?

Noxybot avatar Jun 06 '25 20:06 Noxybot

It seems like in _mi_os_alloc_aligned we initialize mi_memid_t to _mi_memid_none() which leaves initially_committed in uninitialized state? ...

Noxybot avatar Jun 06 '25 20:06 Noxybot

It seems like in _mi_os_alloc_aligned we initialize mi_memid_t to _mi_memid_none() which leaves initially_committed in uninitialized state? ...

I don't think this is it as we reinitialize on success with mi_memid_create_os. I am not sure but since the trace shows a huge singleton block, it might be that only the commit is failing? I added an extra check for the return value of mi_arena_commit.

Still, it is strange since I think the commit parameter is true at mi_arenas_page_alloc_fresh (as we come from singleton) and therefore it should be committed -- if you can break on the failing memzero it would be great if you can see if the memid points to OS memory or arena memory. TBC

daanx avatar Jun 06 '25 23:06 daanx

Unfortunately, I'm unable to repro it :( It's a crash I've got from the wild. The fix looks promising, tho! I will give it a try a get back to you... Thanks!

Noxybot avatar Jun 07 '25 01:06 Noxybot

Just pushed some more improvements to handling of commit failure in dev3 -- hope it will fix the issue but not sure ..

daanx avatar Jun 07 '25 04:06 daanx

Hi @daanx , the crash is still present :( I've inspected code manually, and it seems like here (in arena.c), we forget to unset the "committed" bit in case commit failes:

      if (!mi_arena_commit(arena, p, mi_size_of_slices(slice_count), &commit_zero, mi_size_of_slices(slice_count - already_committed_count))) {
        // if the commit fails, we roll back
        _mi_arenas_free( p, mi_size_of_slices(slice_count), *memid);  // this will uncommit as well
      // need to call mi_bitmap_clearN(arena->slices_committed, slice_index, slice_count); here? <-------- HERE
        return NULL;
      }

Noxybot avatar Jun 13 '25 20:06 Noxybot

Hi @Noxybot -- ah yes! .. but we already found this issue last Tuesday and fixed it in the dev3-cdb branch -- I will port it to the dev3 branch soon! (apologies -- was just too busy over the past days). I hope that will fix the crash you are seeing!

daanx avatar Jun 13 '25 20:06 daanx

Hi @Noxybot -- ah yes! .. but we already found this issue last Tuesday and fixed it in the dev3-cdb branch -- I will port it to the dev3 branch soon! (apologies -- was just too busy over the past days). I hope that will fix the crash you are seeing!

Thanks for confirming. I’m looking forward to have this fix in dev3!

Noxybot avatar Jun 13 '25 23:06 Noxybot

I pushed v3.1.5 that should fix this now.

daanx avatar Jun 14 '25 05:06 daanx

Hi @Daanx, the latest dev3 is much better (~8 times less crashes), however there are still some:

  1. nullptr access in mi_page_map_set_range:
 mi_page_t** sub = mi_page_map_ensure_submap_at(idx);
    // set the offsets for the page
    while (sub_idx < MI_PAGE_MAP_SUB_COUNT) {
      sub[sub_idx] = page; //<---- sub might NULL here (see mi_page_map_ensure_submap_at)
      slice_count--; if (slice_count == 0) return;
      sub_idx++;
    }

mi_page_map_ensure_submap_at can return NULL:

    const size_t submap_size = MI_PAGE_MAP_SUB_SIZE;
    sub = (mi_page_t**)_mi_os_zalloc(submap_size, &memid);
    if (sub == NULL) {
      _mi_error_message(EFAULT, "internal error: unable to extend the page map\n");
      return NULL;
    }
  1. There seem to be some rare crashes in _mi_page_malloc_zero / mi_page_thread_collect_to_local, but it's hard to tell what's exactly wrong there. (Both crashes seems to be close to where we call mi_block_next).

Noxybot avatar Jun 16 '25 19:06 Noxybot

Sorry, I haven't shared callstack, which is extremely unhelpful :) Here is the callstack for the first issue:

1
mi_page_map_set_range(mi_page_s *,unsigned __int64,unsigned __int64,unsigned __int64)
‎page-map.c:321
2
mi_arenas_page_alloc_fresh(unsigned __int64,unsigned __int64,unsigned __int64,mi_arena_s *,int,bool,mi_tld_s *)
‎arena.c:716
3
mi_arenas_page_singleton_alloc(mi_heap_s *,unsigned __int64,unsigned __int64)
‎arena.c:767
4
mi_page_fresh_alloc(mi_heap_s *,mi_page_queue_s *,unsigned __int64,unsigned __int64)
‎page.c:306
5
mi_huge_page_alloc(mi_heap_s *,unsigned __int64,unsigned __int64,mi_page_queue_s *)
‎page.c:897
6
mi_find_page(mi_heap_s *,unsigned __int64,unsigned __int64)
‎page.c:925
7
_mi_malloc_generic(mi_heap_s *,unsigned __int64,bool,unsigned __int64)
‎page.c:968
8
_mi_heap_realloc_zero(mi_heap_s *,void *,unsigned __int64,bool)

And here is the callstack for second issue:

1
mi_page_thread_collect_to_local(mi_page_s *,mi_block_s *)
‎page.c:151
2
mi_page_thread_free_collect(mi_page_s *)
‎page.c:184
3
_mi_page_free_collect(mi_page_s *,bool)
‎page.c:191
4
_mi_heap_page_reclaim(mi_heap_s *,mi_page_s *)
‎page.c:277
5
mi_free_try_collect_mt(mi_page_s *,mi_block_s *)

Noxybot avatar Jun 16 '25 22:06 Noxybot

I fixed the first issue in the latest dev3. The second one is mysterious... I hope fixing the first issue fixes that one as well since I have no initial idea what the cause may be there.

daanx avatar Jun 18 '25 03:06 daanx

Hi Daan, the latest dev3 is much better (the best we've seen so far), however, there still seems to be a tiny crash the reason for which is unclear...

[0x0]   App.exe!mi_page_thread_collect_to_local+0x1f()   (Inline Function)   (Inline Function)   
[0x1]   App.exe!mi_page_thread_free_collect+0x42()   (Inline Function)   (Inline Function)   
[0x2]   App.exe!_mi_page_free_collect+0x4c(mi_page_s * page = 0x2982ede0000, bool force = false)   0x297ebc5f1c0   0x7ff728e0d50b   
[0x3]   App.exe!_mi_heap_page_reclaim+0x4b(mi_heap_s * heap = 0x29781ea9380, mi_page_s * page = 0x2982ede0000)   0x297ebc5f1f0   0x7ff728e03ae6   
[0x4]   App.exe!mi_free_try_collect_mt+0x336(mi_page_s * page = 0x2982ede0000, mt_free = <unavailable>)   0x297ebc5f220   0x7ff728dfe012   

It seems like at some point mi_block_next returns invalid next block.

Image

Noxybot avatar Jun 30 '25 18:06 Noxybot