box64 [Doc] Box64 Memory Allocator (for review and feedback)

This document describes the memory management design in Box64 (custommem.c). The goal is to verify the correctness and gather feedback for improvements. Below is the current draft:

High-Level Overview

The memory allocator operates similarly to a segregated free list by maintaining an array of blocks. In this context, a "block" refers to one of three types of fixed-size memory mappings:

512 KB blocks for general use
128 KB blocks for fixed 128-byte slices
2 MB blocks for dynarec (dynamic recompilation)

Within each block, smaller pieces — called subblocks — are carved out. Subblocks are the actual allocation units returned to the caller.

There are two types of free subblock tracking implementations:

BTYPE_LIST: Managed using an implicit free list, with blockmark_t headers placed before each payload. The end of a block is indicated by a special subblock where next.x32 == 0.
BTYPE_MAP: Subblocks are tracked using a bitmap at the end of the block.

Each block is tracked by a blocklist_s structure — metadata for each mapped region:

typedef struct blocklist_s {
    void*    block;    // Base address of the mapped region
    size_t   maxfree;  // Largest contiguous free space (in bytes) available
    size_t   size;     // Total size of the region (including bitmap if present)
    void*    first;    // Pointer to bitmap (`BTYPE_MAP`) or to the first free subblock (`BTYPE_LIST`)
    uint32_t lowest;   // Hint index for next allocation scan (bitmap only)
    uint8_t  type;     // Either `BTYPE_LIST` or `BTYPE_MAP`
} blocklist_t;

In BTYPE_LIST, subblocks are linked implicitly via the offset field (offs) in blockmark_s:

typedef struct blockmark_s {
    mark_t  prev;    // Marker for the previous subblock
    mark_t  next;    // Marker for the next subblock
    uint8_t mark[];  // Flexible array member; payload starts here
} blockmark_t;

Where mark_t represents the distance between two subblock headers:

typedef union mark_s {
    struct {
        unsigned int offs : 31;  // Byte offset to the next or previous header
        unsigned int fill : 1;   // Allocation flag (0 = free, 1 = allocated)
    };
    uint32_t x32;
} mark_t;

A set of macros is provided to help navigate between subblocks inside a block. These macros rely on the offs fields inside the mark_t structures:

#define NEXT_BLOCK(b)  ((blockmark_t*)((uintptr_t)(b) + (b)->next.offs))
// Get the pointer to the next subblock.

#define PREV_BLOCK(b)  ((blockmark_t*)((uintptr_t)(b) - (b)->prev.offs))
// Get the pointer to the previous subblock.

#define LAST_BLOCK(b, s)  ((blockmark_t*)((uintptr_t)(b) + (s)) - sizeof(blockmark_t))
// Get the pointer to the subblock at the end of the block with size 's'.

#define SIZE_BLOCK(b)  (((ssize_t)(b).offs) - sizeof(blockmark_t))
// Get the usable payload size of a subblock (excluding the size of its own header).

Apr 29 '25 07:04 devarajabc

That seems prety accurate yes. One note: first, in the case of BTYPE_LIST, is the first "free" block, not just the first block...

I might introduce a union inside blocklist_t struture, to get a cleaner view...

Apr 29 '25 08:04 ptitSeb

Thanks for the note about first —already fixed it. Actually, the reason I wrote this document is because the naming and structure in custommem.c felt a bit chaotic at first. I thought writing it down clearly could help contributors (including myself) better understand. Would you prefer that I add this document as comment inside custommem.c, or keep it as a separate file ?

Apr 29 '25 08:04 devarajabc

The doc can be separated, that's not a problem. There are still many things that could be documented. Like the BTYPE_MAP: it's introduced to help with the small allocation, that are used a lot internataly with rbtree and early steps of hash table... The hope was to be able to do a lock free allocator (still not done, not sure it's doable because of the "lowest" optimisation). There are some special case handling on the BTYPE_MAP whe rbtree are still not inited yet... Or when there is a re-entrant call bacuse of a rbtree allocation...

Also, a custom allocator is needed because glibc alocator can be interupted with a signal, and would lead to a dead lock...

Apr 29 '25 08:04 ptitSeb

Thanks a lot for the detailed explanation! If you have time, feel free to list any other tricky parts or important areas that you think should also be documented — I’d be happy to include them.

Apr 29 '25 14:04 devarajabc

Allocator

The custom memory system in box64 provides three distinct allocators, all sharing the same underlying data structures.

map128_customMalloc: A simple bitmap-based allocator that allocates fixed 128-byte slices only.
internal_customMalloc: A generalized allocator for memory requests larger than 128 bytes.
AllocDynarecMap: A specialized allocator used by the dynamic recompiler.

For map128_customMalloc and internal_customMalloc, blocks are tracked using a global array named p_blocks:

static int       n_blocks    = 0;    // how many blocks have been allocated so far
static int       c_blocks    = 0;    // capacity of the p_blocks[] array
static blocklist_t* p_blocks = NULL; // the dynamically resized array of blocklist_t

For AllocDynarecMap, blocks are tracked via a global linked list called mmaplist:

static mmaplist_t          *mmaplist = NULL;

Each blocklist_t is wrapped inside a mapchunk_t structure:

typedef struct mapchunk_s {
    blocklist_t chunk;
} mapchunk_t;

These mapchunk_t entries are stored in a fixed-size array within each mmaplist_t node:

typedef struct mmaplist_s {
    mapchunk_t      chunks[NCHUNK];  // 64 blocklist_t slots
    struct mmaplist_s* next;         // link to the next group of 64
} mmaplist_t;

In short, mmaplist_t is simply a linked list, where each node contains an array of chunks — each chunk wrapping a blocklist_t.

[ mmaplist0 ] → [ mmaplist1 ] → [ mmaplist2 ] → …  
      │              │              │  
   chunks[0]      chunks[0]      chunks[0]    each chunk[i] is a blocklist_t  
   chunks[1]      chunks[1]      chunks[1]  
    ...            ...            ...  
   chunks[63]     chunks[63]     chunks[63]

Placement

Allocators in custmmem.c adopt an Address-Ordered First-Fit strategy, in which blocks are organized in strictly ascending order based on their base addresses.

When a range of $K$ bytes is requested, the allocator performs a linear scan of the p_blocks array (or mmaplist_t in the case of AllocDynarecMap) and selects the first block that:

matches the appropriate type (BTYPE_MAP or BTYPE_LIST),
has sufficient free space (maxfree ≥ $K$),
and satisfies platform-specific address constraints (e.g., within 32-bit bounds if needed).

If a suitable block is not found after a complete traversal, a new memory region is requested from the kernel using an mmap-like function, depending on the allocation context. Once the memory is mapped, a new blocklist_t entry is initialized to track the allocated region.

┌────────┬────────┬───────────────────────────────┬────────┬────────┐
│ m.prev │ m.next │            PAYLOAD            │ n.prev │ n.next │
│ 0      │ offs   │  (allocsize - 2·sizeof(mark)) │  offs  │  0     │
└────────┴────────┴───────────────────────────────┴────────┴────────┘
↑                                                                   ↑
p (block start)                                      p + allocsize (block end)

Blocks allocated by map128_customMalloc and internal_customMalloc are tracked in an auxiliary red-black tree called blockstree. Each node stores the block’s start and end addresses, and its data field holds the index of the block in the p_blocks array.

Blocks allocated by AllocDynarecMap are tracked in a separate red-black tree, rbt_dynmem, where each node’s data field stores a pointer to the corresponding chunk entry.

Both trees provide an efficient mechanism for resolving a memory address to its corresponding blocklist_t.

May 01 '25 12:05 devarajabc

@ptitSeb , Here are the second and third parts of the document. Also, is the main difference between AllocDynarecMap() and internal_customMalloc() that AllocDynarecMap() is designed for single-threaded use, since it doesn't use any locks or atomic operations?

May 01 '25 12:05 devarajabc

This looks correct to me.

Note that I have just introduced "DynaCache" that serialize/deserialize Dynarec map to disk. There have been a few change internaly to make this works, but I don't think it changed any of the details on the memory allocator... (this also explains the difference between AllocDynarecMap and internal_customMalloc)

Jun 20 '25 12:06 ptitSeb

OK, I will update the document since some of the structure has changed.

Jun 24 '25 23:06 devarajabc

The doc can be separated, that's not a problem. There are still many things that could be documented. Like the BTYPE_MAP: it's introduced to help with the small allocation, that are used a lot internataly with rbtree and early steps of hash table... The hope was to be able to do a lock free allocator (still not done, not sure it's doable because of the "lowest" optimisation). There are some special case handling on the BTYPE_MAP whe rbtree are still not inited yet... Or when there is a re-entrant call bacuse of a rbtree allocation...

Also, a custom allocator is needed because glibc alocator can be interupted with a signal, and would lead to a dead lock...

For the lock‑free memory allocator, the initial idea is to use the characteristics of p_blocks (where there are different free lists stored). That’s it! We don’t need to protect the entire p_block, so memory requests can be satisfied from different free blocks within p_blocks simultaneously.” (still in progress)

But the key problem is: Does box64 fork itself if the executable it is emulating issues a fork system call? It will become more complicated when there is more than one p_blocks.

Jun 24 '25 23:06 devarajabc

Allocator

The custom memory allocation system in Box64 comprises four distinct allocators, all utilizing shared underlying data structures. These allocators function similarly to segregated free lists by maintaining arrays of memory blocks. A block refers to one of several fixed-size memory pools. Each block is subdivided into smaller units called subblocks, which represent the actual memory chunks returned to callers upon allocation.

Allocator	Subblock Tracking	Block Size	Description
internal_customMalloc	BTYPE_LIST	512 KB	Implicit free list; general allocations.
map128_customMalloc	BTYPE_MAP	128 KB	Bitmap; optimized for fixed 128-byte slices.
map64_customMalloc	BTYPE_MAP64	128 KB	Bitmap; optimized for fixed 64-byte slices.
AllocDynarecMap	BTYPE_LIST	2 MB / 128 KB (first page)	Implicit free list; used for dynarec allocations.

Internal Structure

Blocks are managed through the blocklist_s structure, which stores metadata for each mapped memory region:

typedef struct blocklist_s {
    void*    block;    // Base address of the mapped region
    size_t   maxfree;  // Largest contiguous free space (in bytes) available
    size_t   size;     // Total size of the region (including bitmap if present)
    void*    first;    // Pointer to bitmap (`BTYPE_MAP`) or to the first free subblock (`BTYPE_LIST`)
    uint32_t lowest;   // Hint index for next allocation scan (bitmap only)
    uint8_t  type;     // `BTYPE_LIST`, ``BTYPE_MAP64`  or `BTYPE_MAP`
    uint8_t  is32bits; // Indicates whether the block operates in 32-bit mode
} blocklist_t;

In BTYPE_LIST, subblocks are linked implicitly via the offset field (offs) in blockmark_s:

typedef struct blockmark_s {
    mark_t  prev;    // Marker for the previous subblock
    mark_t  next;    // Marker for the next subblock
    uint8_t mark[];  // Flexible array member; payload starts here
} blockmark_t;

┌────────┬────────┬───────────────────────────────┬────────┬────────┐
│ m.prev │ m.next │            PAYLOAD            │ n.prev │ n.next │
│ 0      │ offs   │  (allocsize - 2·sizeof(mark)) │  offs  │  0     │
└────────┴────────┴───────────────────────────────┴────────┴────────┘
↑                                                                   ↑
p (block start)                                      p + allocsize (block end)

Where mark_t represents the distance between two subblock headers:

typedef union mark_s {
    struct {
        unsigned int offs : 31;  // Byte offset to the next or previous header
        unsigned int fill : 1;   // Allocation flag (0 = free, 1 = allocated)
    };
    uint32_t x32;
} mark_t;

Block Tracking

For map64_customMalloc, map128_customMalloc and internal_customMalloc, blocks are tracked using a global array named p_blocks:

static int       n_blocks    = 0;    // how many blocks have been allocated so far
static int       c_blocks    = 0;    // capacity of the p_blocks[] array
static blocklist_t* p_blocks = NULL; // the dynamically resized array of blocklist_t

Each block in p_blocks may use a different subblock tracking type, depending on which allocator created it.

In addition, each block is also tracked in an auxiliary red-black tree called blockstree. Each node stores the block’s start and end addresses, and its data field holds the index of the block in the p_blocks array.

For AllocDynarecMap, blocks are tracked via per-file structure call mmaplist_t

mmaplist_t* ml = GetMmaplistByAddr(addr);

Internally, each mmaplist_t instance maintains an expandable array of blocklist_t entries

typedef struct mmaplist_s {
    blocklist_t**   chunks;     
    int             cap;        //Capacity of the chunks array
    int             size;       //Number of active entries currently stored in ‘chunks’
    int             has_new;    //set to 1 when new entries are appended
    int             dirty;      //set to 1 if any region has been unmapped or requires cleanup
} mmaplist_t;

Blocks allocated by AllocDynarecMap are tracked in a separate red-black tree, rbt_dynmem, where each node’s data field stores a pointer to the corresponding block.

Placement

Allocators in custmmem.c adopt an Address-Ordered First-Fit strategy, in which blocks are organized in strictly ascending order based on their base addresses.

When a range of $K$ bytes is requested, the allocator performs a linear scan of the p_blocks array (or mmaplist_t in the case of AllocDynarecMap) and selects the first block that:

matches the appropriate type,
has sufficient free space (maxfree ≥ $K$),
and satisfies platform-specific address constraints (e.g., within 32-bit bounds if needed).

If a suitable block is not found after a complete traversal, a new memory region is requested from the kernel using an mmap-like function, depending on the allocation context. Once the memory is mapped, a new blocklist_t entry is initialized to track the allocated region.

Jul 16 '25 03:07 devarajabc

Besides, I conducted an experiment to measure the memory usage of the 3D game “Petseting” running on a Raspberry Pi 5, and found that the RSS remains high throughout execution because long-lived memory blocks persist until teardown.

This memory usage pattern helps explain why the current allocation policy — first fit — performs well in this scenario. In contrast, a best fit strategy might perform poorly here, as it relies on reclaiming freed memory, which rarely happens in this case.

Sep 25 '25 06:09 devarajabc