box64 icon indicating copy to clipboard operation
box64 copied to clipboard

Reduce peak RSS via cold dynablock reclamation

Open devarajabc opened this issue 4 months ago • 12 comments

Telemetry on workload(same config as in issue #2511 but with the new version of box64) shows that ~80.6% of blocks(total 12,478 blocks) and ~77.2% of bytes are freed only in the final exit.

This means peak RSS stays high throughout execution because long-lived dynablocks persist until teardown.

Image

So I tried to count “use times” for each dynamic block by tallying getDB() lookups, but found that only ~23 dynablock addresses account for ~131k lookups, while most blocks barely register.

My guess is that Box64 uses block chaining / direct linking(LinkNext() ?), so getDB() isn’t the only (or even main) entry path into a dynablock.

Rank Address Lookups Share of total Cum. of total
1 0xffff68a02b62 69543 53.1% 53.1%
2 0xffff68a1399b 15807 12.1% 65.1%
3 0xffff68a137d6 15807 12.1% 77.2%
4 0xffff68a105f5 12196 9.3% 86.5%
5 0xffff79e6ba4b 8759 6.7% 93.2%
6 0xffff7cc61b85 4625 3.5% 96.7%
7 0xffff7e265c9c 3588 2.7% 99.4%
8 0xffff80888425 261 0.2% 99.6%
9 0xffff9f2a4f65 261 0.2% 99.8%
10 0xffff808a9040 53 0.0% 99.9%
11 0xffff80a126ad 27 0.0% 99.9%
12 0xffff80a42126 27 0.0% 99.9%
13 0xffff68a06492 20 0.0% 99.9%
14 0xffff7e2a94c3 20 0.0% 100.0%
15 0xffff63e47dab 19 0.0% 100.0%
16 0xffff808abb9b 15 0.0% 100.0%
17 0xffff68a028ea 8 0.0% 100.0%
18 0xffff7cc4856b 6 0.0% 100.0%
19 0xffff68a05853 2 0.0% 100.0%
20 0xffff68a04a8c 2 0.0% 100.0%
21 0xffff808aa9e3 2 0.0% 100.0%
22 0xffff9f23c68e 2 0.0% 100.0%
23 0xffff7e2a93c0 1 0.0% 100.0%

Question:

What’s the recommended way to measure per-dynablock entry counts given direct linking?

With reliable entry counts , we can reduce peak RSS via free the least used blocks (LRU).

Thanks!

devarajabc avatar Aug 31 '25 04:08 devarajabc

Block lifetime is a complicated mater. I did some experiment to free unused block but that wasn't very conclusive.

By design, blocks can be chained internaly: with the jumptable mecanism.

The right way to measure block usage would be to add a prolog to each block and use some atomic increment on a private dynablock counter. Exit of a block, or detectic when a block is not used anymore is even more complicated, and things like Callret optimisation even makes things more complicated.

ptitSeb avatar Aug 31 '25 06:08 ptitSeb

Thanks for the explanation!

If you have a moment, could you share a bit more about your experiment to reclaim unused blocks and how you measured it—especially:

  • Counting entries under chaining/direct links
  • Why the results were inconclusive ?

One possible solution is to maintain an additional red-black tree that tracks every dynablock and query it to determine which block contains the current PC(point lookup everytime).

Thanks

devarajabc avatar Aug 31 '25 22:08 devarajabc

Hi @ptitSeb,

I've been experimenting with tracking dynarec block usage by adding an atomic usage_count field to dynablock_t and instrumenting each block to increment it at runtime. I also implemented a global linked list (using the existing mutex_dyndump lock) that tracks all living blocks, allowing statistics collection about block lifecycle.

The testing shows that most dynamic blocks are executed very infrequently:

Simple program (qsort):

Total: 72 blocks (15.4 KB)
Executed ≤1 time: 22 blocks (30.6%) → 3.4 KB (22.1%)

Complex game (Petsitting):

Total: 399,513 blocks (49.24 MB)
Executed ≤1 time: 323,780 blocks (81.0%) → 32.93 MB (66.9%)

full log : box64_block_stats.txt

This suggests 20-66% of dynarec memory is potentially wasted on cold blocks, especially in larger applications.

Question:

Would implementing a code cache with LRU/LFU eviction be feasible?

Potential benefits:

  • Reduce memory usage by 30–70% (approximately 3–33 MB per process)
  • Beneficial for memory-constrained devices
  • No expected speed improvement (lookups are already O(1)), but better memory efficiency

Considerations:

This change would require significant modifications. For example, the current first-fit allocator(as shown in #2588) might experience fragmentation due to frequent allocation/deallocation cycles, so it may need to be replaced with a best-fit allocator.

I'm happy to contribute a complete implementation if this aligns with Box64's goals. I've already built the statistics tracking infrastructure that could serve as the foundation(And so as best-fit allocator).


Experiment details:

  • Add an atomic usage_count field to the dynablock_t structure
  • Add Instrumentation code at the beginning of each compiled block that atomically increments this counter
  • Use A background statistics thread that periodically samples and reports block usage patterns
  • Output logging to track block lifecycle and memory consumption over time

How to Reproduce:

Note: This feature is still experimental and currently only runs on Raspberry Pi 5.

  1. Clone my branch: https://github.com/devarajabc/box64/tree/DB_PROF_WITH_MD
  2. Build & run Box64 as usual — it will save the log to box64_block_stats.txt.

devarajabc avatar Nov 06 '25 15:11 devarajabc

The issue here is that to be able to free old/unused block, you need to make sure blocks are not in-use anymore. The counter you added is only good to see the "hotness" of a block, but not if a block is in used. That would require another counter, atomically incremented when entering the block, and atomically decremented when leaving the block. Also, need to take into account DYNAREC_CALLRET=1 where CALL opcode do not really exit the block. Also, some specific wrapped functions can also never return (like longjump or exit), while many will return (and so the block should be considered alive).

So, all this can of course be handled, but it add more complexity. There are scenario were this would still be interesting (like, for example, running Steam). At minimum it should be configurable. Idealy with some runtime settings like BOX64_DYNAREC_RECYCLE. Without the settings, dynablocks generation would not be touched. With the settings, all instrumentation gets injected in blocks creations (and DynaCache better be disabled)

ptitSeb avatar Nov 06 '25 17:11 ptitSeb

Thank you for providing so many details that I wasn’t aware of. I’ll do more experiments and update here.

Yes, just as you suggested, the code cache system should be an optional feature, since this mechanism may not benefit every scenario.

devarajabc avatar Nov 07 '25 08:11 devarajabc

Note that I will probably work on this subject very soon. A first implemention, with limitation, but enough to already see reduced memory footprint when activated.

ptitSeb avatar Nov 14 '25 11:11 ptitSeb

Great to hear that you’re already working on an initial implementation. I can help on complementary areas — for example, evaluating different cache-replacement policies, validating the heuristics with profiling data, or running targeted benchmarks to measure the impact. Let me know which direction would be most useful. Thanks!

devarajabc avatar Nov 14 '25 12:11 devarajabc

Great to hear that you’re already working on an initial implementation. I can help on complementary areas — for example, evaluating different cache-replacement policies, validating the heuristics with profiling data, or running targeted benchmarks to measure the impact. Let me know which direction would be most useful. Thanks!

That sounds great.

I will probably just do a simple "remove all the free block executed only once" strategy before allocating a new block as a first ilplementation. Wich is a quite naive approach, and certainly not the most efficient one.

ptitSeb avatar Nov 14 '25 12:11 ptitSeb

Sounds good! I can help by testing this approach and providing profiling data.(or if there are something i can do)

BTW, I’m currently experimenting with an LFU-based approach in my local branch (the counting part is already implemented, but not the eviction logic yet).

What’s the best way to collaborate on this? Should I keep updating findings in this issue, or would you prefer discussing it elsewhere (e.g., Discord or a separate thread)?

devarajabc avatar Nov 14 '25 13:11 devarajabc

I guess this ticket is fine unless you want more realtime discussion.

ptitSeb avatar Nov 14 '25 13:11 ptitSeb

Thanks! I’d actually like to have more real-time discussions if possible. Do you prefer using Discord, slack, or any other channel for that?

devarajabc avatar Nov 14 '25 13:11 devarajabc

pushed a first implementation there: https://github.com/ptitSeb/box64/commit/81b080eb5d5dd2a23e5b952d11efea24bfe4c2fa

ptitSeb avatar Nov 15 '25 09:11 ptitSeb