Reduce peak RSS via cold dynablock reclamation
Telemetry on workload(same config as in issue #2511 but with the new version of box64) shows that ~80.6% of blocks(total 12,478 blocks) and ~77.2% of bytes are freed only in the final exit.
This means peak RSS stays high throughout execution because long-lived dynablocks persist until teardown.
So I tried to count “use times” for each dynamic block by tallying getDB() lookups, but found that only ~23 dynablock addresses account for ~131k lookups, while most blocks barely register.
My guess is that Box64 uses block chaining / direct linking(LinkNext() ?), so getDB() isn’t the only (or even main) entry path into a dynablock.
| Rank | Address | Lookups | Share of total | Cum. of total |
|---|---|---|---|---|
| 1 | 0xffff68a02b62 |
69543 | 53.1% | 53.1% |
| 2 | 0xffff68a1399b |
15807 | 12.1% | 65.1% |
| 3 | 0xffff68a137d6 |
15807 | 12.1% | 77.2% |
| 4 | 0xffff68a105f5 |
12196 | 9.3% | 86.5% |
| 5 | 0xffff79e6ba4b |
8759 | 6.7% | 93.2% |
| 6 | 0xffff7cc61b85 |
4625 | 3.5% | 96.7% |
| 7 | 0xffff7e265c9c |
3588 | 2.7% | 99.4% |
| 8 | 0xffff80888425 |
261 | 0.2% | 99.6% |
| 9 | 0xffff9f2a4f65 |
261 | 0.2% | 99.8% |
| 10 | 0xffff808a9040 |
53 | 0.0% | 99.9% |
| 11 | 0xffff80a126ad |
27 | 0.0% | 99.9% |
| 12 | 0xffff80a42126 |
27 | 0.0% | 99.9% |
| 13 | 0xffff68a06492 |
20 | 0.0% | 99.9% |
| 14 | 0xffff7e2a94c3 |
20 | 0.0% | 100.0% |
| 15 | 0xffff63e47dab |
19 | 0.0% | 100.0% |
| 16 | 0xffff808abb9b |
15 | 0.0% | 100.0% |
| 17 | 0xffff68a028ea |
8 | 0.0% | 100.0% |
| 18 | 0xffff7cc4856b |
6 | 0.0% | 100.0% |
| 19 | 0xffff68a05853 |
2 | 0.0% | 100.0% |
| 20 | 0xffff68a04a8c |
2 | 0.0% | 100.0% |
| 21 | 0xffff808aa9e3 |
2 | 0.0% | 100.0% |
| 22 | 0xffff9f23c68e |
2 | 0.0% | 100.0% |
| 23 | 0xffff7e2a93c0 |
1 | 0.0% | 100.0% |
Question:
What’s the recommended way to measure per-dynablock entry counts given direct linking?
With reliable entry counts , we can reduce peak RSS via free the least used blocks (LRU).
Thanks!
Block lifetime is a complicated mater. I did some experiment to free unused block but that wasn't very conclusive.
By design, blocks can be chained internaly: with the jumptable mecanism.
The right way to measure block usage would be to add a prolog to each block and use some atomic increment on a private dynablock counter. Exit of a block, or detectic when a block is not used anymore is even more complicated, and things like Callret optimisation even makes things more complicated.
Thanks for the explanation!
If you have a moment, could you share a bit more about your experiment to reclaim unused blocks and how you measured it—especially:
- Counting entries under chaining/direct links
- Why the results were inconclusive ?
One possible solution is to maintain an additional red-black tree that tracks every dynablock and query it to determine which block contains the current PC(point lookup everytime).
Thanks
Hi @ptitSeb,
I've been experimenting with tracking dynarec block usage by adding an atomic usage_count field to dynablock_t and instrumenting each block to increment it at runtime. I also implemented a global linked list (using the existing mutex_dyndump lock) that tracks all living blocks, allowing statistics collection about block lifecycle.
The testing shows that most dynamic blocks are executed very infrequently:
Simple program (qsort):
Total: 72 blocks (15.4 KB)
Executed ≤1 time: 22 blocks (30.6%) → 3.4 KB (22.1%)
Complex game (Petsitting):
Total: 399,513 blocks (49.24 MB)
Executed ≤1 time: 323,780 blocks (81.0%) → 32.93 MB (66.9%)
full log : box64_block_stats.txt
This suggests 20-66% of dynarec memory is potentially wasted on cold blocks, especially in larger applications.
Question:
Would implementing a code cache with LRU/LFU eviction be feasible?
Potential benefits:
- Reduce memory usage by 30–70% (approximately 3–33 MB per process)
- Beneficial for memory-constrained devices
- No expected speed improvement (lookups are already O(1)), but better memory efficiency
Considerations:
This change would require significant modifications. For example, the current first-fit allocator(as shown in #2588) might experience fragmentation due to frequent allocation/deallocation cycles, so it may need to be replaced with a best-fit allocator.
I'm happy to contribute a complete implementation if this aligns with Box64's goals. I've already built the statistics tracking infrastructure that could serve as the foundation(And so as best-fit allocator).
Experiment details:
- Add an atomic
usage_countfield to thedynablock_tstructure - Add Instrumentation code at the beginning of each compiled block that atomically increments this counter
- Use A background statistics thread that periodically samples and reports block usage patterns
- Output logging to track block lifecycle and memory consumption over time
How to Reproduce:
Note: This feature is still experimental and currently only runs on Raspberry Pi 5.
- Clone my branch: https://github.com/devarajabc/box64/tree/DB_PROF_WITH_MD
- Build & run Box64 as usual — it will save the log to box64_block_stats.txt.
The issue here is that to be able to free old/unused block, you need to make sure blocks are not in-use anymore. The counter you added is only good to see the "hotness" of a block, but not if a block is in used. That would require another counter, atomically incremented when entering the block, and atomically decremented when leaving the block. Also, need to take into account DYNAREC_CALLRET=1 where CALL opcode do not really exit the block. Also, some specific wrapped functions can also never return (like longjump or exit), while many will return (and so the block should be considered alive).
So, all this can of course be handled, but it add more complexity. There are scenario were this would still be interesting (like, for example, running Steam). At minimum it should be configurable. Idealy with some runtime settings like BOX64_DYNAREC_RECYCLE. Without the settings, dynablocks generation would not be touched. With the settings, all instrumentation gets injected in blocks creations (and DynaCache better be disabled)
Thank you for providing so many details that I wasn’t aware of. I’ll do more experiments and update here.
Yes, just as you suggested, the code cache system should be an optional feature, since this mechanism may not benefit every scenario.
Note that I will probably work on this subject very soon. A first implemention, with limitation, but enough to already see reduced memory footprint when activated.
Great to hear that you’re already working on an initial implementation. I can help on complementary areas — for example, evaluating different cache-replacement policies, validating the heuristics with profiling data, or running targeted benchmarks to measure the impact. Let me know which direction would be most useful. Thanks!
Great to hear that you’re already working on an initial implementation. I can help on complementary areas — for example, evaluating different cache-replacement policies, validating the heuristics with profiling data, or running targeted benchmarks to measure the impact. Let me know which direction would be most useful. Thanks!
That sounds great.
I will probably just do a simple "remove all the free block executed only once" strategy before allocating a new block as a first ilplementation. Wich is a quite naive approach, and certainly not the most efficient one.
Sounds good! I can help by testing this approach and providing profiling data.(or if there are something i can do)
BTW, I’m currently experimenting with an LFU-based approach in my local branch (the counting part is already implemented, but not the eviction logic yet).
What’s the best way to collaborate on this? Should I keep updating findings in this issue, or would you prefer discussing it elsewhere (e.g., Discord or a separate thread)?
I guess this ticket is fine unless you want more realtime discussion.
Thanks! I’d actually like to have more real-time discussions if possible. Do you prefer using Discord, slack, or any other channel for that?
pushed a first implementation there: https://github.com/ptitSeb/box64/commit/81b080eb5d5dd2a23e5b952d11efea24bfe4c2fa