[Doc] Box64 Memory Allocator (for review and feedback)
This document describes the memory management design in Box64 (custommem.c).
The goal is to verify the correctness and gather feedback for improvements.
Below is the current draft:
High-Level Overview
The memory allocator operates similarly to a segregated free list by maintaining an array of blocks. In this context, a "block" refers to one of three types of fixed-size memory mappings:
-
512 KB blocks for general use
-
128 KB blocks for fixed 128-byte slices
-
2 MB blocks for dynarec (dynamic recompilation)
Within each block, smaller pieces — called subblocks — are carved out. Subblocks are the actual allocation units returned to the caller.
There are two types of free subblock tracking implementations:
-
BTYPE_LIST: Managed using an implicit free list, withblockmark_theaders placed before each payload. The end of a block is indicated by a special subblock wherenext.x32 == 0. -
BTYPE_MAP: Subblocks are tracked using abitmapat the end of the block.
Each block is tracked by a blocklist_s structure — metadata for each mapped region:
typedef struct blocklist_s {
void* block; // Base address of the mapped region
size_t maxfree; // Largest contiguous free space (in bytes) available
size_t size; // Total size of the region (including bitmap if present)
void* first; // Pointer to bitmap (`BTYPE_MAP`) or to the first free subblock (`BTYPE_LIST`)
uint32_t lowest; // Hint index for next allocation scan (bitmap only)
uint8_t type; // Either `BTYPE_LIST` or `BTYPE_MAP`
} blocklist_t;
In BTYPE_LIST, subblocks are linked implicitly via the offset field (offs) in blockmark_s:
typedef struct blockmark_s {
mark_t prev; // Marker for the previous subblock
mark_t next; // Marker for the next subblock
uint8_t mark[]; // Flexible array member; payload starts here
} blockmark_t;
Where mark_t represents the distance between two subblock headers:
typedef union mark_s {
struct {
unsigned int offs : 31; // Byte offset to the next or previous header
unsigned int fill : 1; // Allocation flag (0 = free, 1 = allocated)
};
uint32_t x32;
} mark_t;
A set of macros is provided to help navigate between subblocks inside a block.
These macros rely on the offs fields inside the mark_t structures:
#define NEXT_BLOCK(b) ((blockmark_t*)((uintptr_t)(b) + (b)->next.offs))
// Get the pointer to the next subblock.
#define PREV_BLOCK(b) ((blockmark_t*)((uintptr_t)(b) - (b)->prev.offs))
// Get the pointer to the previous subblock.
#define LAST_BLOCK(b, s) ((blockmark_t*)((uintptr_t)(b) + (s)) - sizeof(blockmark_t))
// Get the pointer to the subblock at the end of the block with size 's'.
#define SIZE_BLOCK(b) (((ssize_t)(b).offs) - sizeof(blockmark_t))
// Get the usable payload size of a subblock (excluding the size of its own header).
That seems prety accurate yes. One note: first, in the case of BTYPE_LIST, is the first "free" block, not just the first block...
I might introduce a union inside blocklist_t struture, to get a cleaner view...
Thanks for the note about first —already fixed it.
Actually, the reason I wrote this document is because the naming and structure in custommem.c felt a bit chaotic at first.
I thought writing it down clearly could help contributors (including myself) better understand.
Would you prefer that I add this document as comment inside custommem.c, or keep it as a separate file ?
The doc can be separated, that's not a problem. There are still many things that could be documented. Like the BTYPE_MAP: it's introduced to help with the small allocation, that are used a lot internataly with rbtree and early steps of hash table... The hope was to be able to do a lock free allocator (still not done, not sure it's doable because of the "lowest" optimisation). There are some special case handling on the BTYPE_MAP whe rbtree are still not inited yet... Or when there is a re-entrant call bacuse of a rbtree allocation...
Also, a custom allocator is needed because glibc alocator can be interupted with a signal, and would lead to a dead lock...
Thanks a lot for the detailed explanation! If you have time, feel free to list any other tricky parts or important areas that you think should also be documented — I’d be happy to include them.
Allocator
The custom memory system in box64 provides three distinct allocators, all sharing the same underlying data structures.
-
map128_customMalloc: A simple bitmap-based allocator that allocates fixed 128-byte slices only. -
internal_customMalloc: A generalized allocator for memory requests larger than 128 bytes. -
AllocDynarecMap: A specialized allocator used by the dynamic recompiler.
For map128_customMalloc and internal_customMalloc, blocks are tracked using a global array named p_blocks:
static int n_blocks = 0; // how many blocks have been allocated so far
static int c_blocks = 0; // capacity of the p_blocks[] array
static blocklist_t* p_blocks = NULL; // the dynamically resized array of blocklist_t
For AllocDynarecMap, blocks are tracked via a global linked list called mmaplist:
static mmaplist_t *mmaplist = NULL;
Each blocklist_t is wrapped inside a mapchunk_t structure:
typedef struct mapchunk_s {
blocklist_t chunk;
} mapchunk_t;
These mapchunk_t entries are stored in a fixed-size array within each mmaplist_t node:
typedef struct mmaplist_s {
mapchunk_t chunks[NCHUNK]; // 64 blocklist_t slots
struct mmaplist_s* next; // link to the next group of 64
} mmaplist_t;
In short, mmaplist_t is simply a linked list, where each node contains an array of chunks — each chunk wrapping a blocklist_t.
[ mmaplist0 ] → [ mmaplist1 ] → [ mmaplist2 ] → …
│ │ │
chunks[0] chunks[0] chunks[0] each chunk[i] is a blocklist_t
chunks[1] chunks[1] chunks[1]
... ... ...
chunks[63] chunks[63] chunks[63]
Placement
Allocators in custmmem.c adopt an Address-Ordered First-Fit strategy, in which blocks are organized in strictly ascending order based on their base addresses.
When a range of $K$ bytes is requested, the allocator performs a linear scan of the p_blocks array (or mmaplist_t in the case of AllocDynarecMap) and selects the first block that:
-
matches the appropriate type (
BTYPE_MAPorBTYPE_LIST), -
has sufficient free space (
maxfree≥ $K$), -
and satisfies platform-specific address constraints (e.g., within 32-bit bounds if needed).
If a suitable block is not found after a complete traversal, a new memory region is requested from the kernel using an mmap-like function, depending on the allocation context. Once the memory is mapped, a new blocklist_t entry is initialized to track the allocated region.
┌────────┬────────┬───────────────────────────────┬────────┬────────┐
│ m.prev │ m.next │ PAYLOAD │ n.prev │ n.next │
│ 0 │ offs │ (allocsize - 2·sizeof(mark)) │ offs │ 0 │
└────────┴────────┴───────────────────────────────┴────────┴────────┘
↑ ↑
p (block start) p + allocsize (block end)
Blocks allocated by map128_customMalloc and internal_customMalloc are tracked in an auxiliary red-black tree called blockstree. Each node stores the block’s start and end addresses, and its data field holds the index of the block in the p_blocks array.
Blocks allocated by AllocDynarecMap are tracked in a separate red-black tree, rbt_dynmem, where each node’s data field stores a pointer to the corresponding chunk entry.
Both trees provide an efficient mechanism for resolving a memory address to its corresponding blocklist_t.
@ptitSeb , Here are the second and third parts of the document.
Also, is the main difference between AllocDynarecMap() and internal_customMalloc() that AllocDynarecMap() is designed for single-threaded use, since it doesn't use any locks or atomic operations?
This looks correct to me.
Note that I have just introduced "DynaCache" that serialize/deserialize Dynarec map to disk. There have been a few change internaly to make this works, but I don't think it changed any of the details on the memory allocator... (this also explains the difference between AllocDynarecMap and internal_customMalloc)
OK, I will update the document since some of the structure has changed.
The doc can be separated, that's not a problem. There are still many things that could be documented. Like the BTYPE_MAP: it's introduced to help with the small allocation, that are used a lot internataly with rbtree and early steps of hash table... The hope was to be able to do a lock free allocator (still not done, not sure it's doable because of the "lowest" optimisation). There are some special case handling on the BTYPE_MAP whe rbtree are still not inited yet... Or when there is a re-entrant call bacuse of a rbtree allocation...
Also, a custom allocator is needed because glibc alocator can be interupted with a signal, and would lead to a dead lock...
For the lock‑free memory allocator, the initial idea is to use the characteristics of p_blocks (where there are different free lists stored). That’s it! We don’t need to protect the entire p_block, so memory requests can be satisfied from different free blocks within p_blocks simultaneously.” (still in progress)
But the key problem is: Does box64 fork itself if the executable it is emulating issues a fork system call? It will become more complicated when there is more than one p_blocks.
Allocator
The custom memory allocation system in Box64 comprises four distinct allocators, all utilizing shared underlying data structures. These allocators function similarly to segregated free lists by maintaining arrays of memory blocks. A block refers to one of several fixed-size memory pools. Each block is subdivided into smaller units called subblocks, which represent the actual memory chunks returned to callers upon allocation.
| Allocator | Subblock Tracking | Block Size | Description |
|---|---|---|---|
| internal_customMalloc | BTYPE_LIST | 512 KB | Implicit free list; general allocations. |
| map128_customMalloc | BTYPE_MAP | 128 KB | Bitmap; optimized for fixed 128-byte slices. |
| map64_customMalloc | BTYPE_MAP64 | 128 KB | Bitmap; optimized for fixed 64-byte slices. |
| AllocDynarecMap | BTYPE_LIST | 2 MB / 128 KB (first page) | Implicit free list; used for dynarec allocations. |
Internal Structure
Blocks are managed through the blocklist_s structure, which stores metadata for each mapped memory region:
typedef struct blocklist_s {
void* block; // Base address of the mapped region
size_t maxfree; // Largest contiguous free space (in bytes) available
size_t size; // Total size of the region (including bitmap if present)
void* first; // Pointer to bitmap (`BTYPE_MAP`) or to the first free subblock (`BTYPE_LIST`)
uint32_t lowest; // Hint index for next allocation scan (bitmap only)
uint8_t type; // `BTYPE_LIST`, ``BTYPE_MAP64` or `BTYPE_MAP`
uint8_t is32bits; // Indicates whether the block operates in 32-bit mode
} blocklist_t;
In BTYPE_LIST, subblocks are linked implicitly via the offset field (offs) in blockmark_s:
typedef struct blockmark_s {
mark_t prev; // Marker for the previous subblock
mark_t next; // Marker for the next subblock
uint8_t mark[]; // Flexible array member; payload starts here
} blockmark_t;
┌────────┬────────┬───────────────────────────────┬────────┬────────┐
│ m.prev │ m.next │ PAYLOAD │ n.prev │ n.next │
│ 0 │ offs │ (allocsize - 2·sizeof(mark)) │ offs │ 0 │
└────────┴────────┴───────────────────────────────┴────────┴────────┘
↑ ↑
p (block start) p + allocsize (block end)
Where mark_t represents the distance between two subblock headers:
typedef union mark_s {
struct {
unsigned int offs : 31; // Byte offset to the next or previous header
unsigned int fill : 1; // Allocation flag (0 = free, 1 = allocated)
};
uint32_t x32;
} mark_t;
Block Tracking
For map64_customMalloc, map128_customMalloc and internal_customMalloc, blocks are tracked using a global array named p_blocks:
static int n_blocks = 0; // how many blocks have been allocated so far
static int c_blocks = 0; // capacity of the p_blocks[] array
static blocklist_t* p_blocks = NULL; // the dynamically resized array of blocklist_t
Each block in p_blocks may use a different subblock tracking type, depending on which allocator created it.
In addition, each block is also tracked in an auxiliary red-black tree called blockstree. Each node stores the block’s start and end addresses, and its data field holds the index of the block in the p_blocks array.
For AllocDynarecMap, blocks are tracked via per-file structure call mmaplist_t
mmaplist_t* ml = GetMmaplistByAddr(addr);
Internally, each mmaplist_t instance maintains an expandable array of blocklist_t entries
typedef struct mmaplist_s {
blocklist_t** chunks;
int cap; //Capacity of the chunks array
int size; //Number of active entries currently stored in ‘chunks’
int has_new; //set to 1 when new entries are appended
int dirty; //set to 1 if any region has been unmapped or requires cleanup
} mmaplist_t;
Blocks allocated by AllocDynarecMap are tracked in a separate red-black tree, rbt_dynmem, where each node’s data field stores a pointer to the corresponding block.
Placement
Allocators in custmmem.c adopt an Address-Ordered First-Fit strategy, in which blocks are organized in strictly ascending order based on their base addresses.
When a range of $K$ bytes is requested, the allocator performs a linear scan of the p_blocks array (or mmaplist_t in the case of AllocDynarecMap) and selects the first block that:
-
matches the appropriate type,
-
has sufficient free space (
maxfree≥ $K$), -
and satisfies platform-specific address constraints (e.g., within 32-bit bounds if needed).
If a suitable block is not found after a complete traversal, a new memory region is requested from the kernel using an mmap-like function, depending on the allocation context. Once the memory is mapped, a new blocklist_t entry is initialized to track the allocated region.
Besides, I conducted an experiment to measure the memory usage of the 3D game “Petseting” running on a Raspberry Pi 5, and found that the RSS remains high throughout execution because long-lived memory blocks persist until teardown.
This memory usage pattern helps explain why the current allocation policy — first fit — performs well in this scenario. In contrast, a best fit strategy might perform poorly here, as it relies on reclaiming freed memory, which rarely happens in this case.