ares icon indicating copy to clipboard operation
ares copied to clipboard

Add N64 recompiler block hashes & inline 64-bit ops

Open ghost opened this issue 1 year ago • 5 comments

Reworking recompiler's JIT block accesses to be more flexible so that we can bake more runtime info to them. This way runtime checks can be avoided in the generated code, making inlining the implementations more attractive since we don't need to generate calls to e.g. kernel mode checks for every 64-bit op. This is expected to increase recompiled code's performance eventually.

I expect these changes to have a small negative performance impact but haven't measured it.

Commit messages:

We'd like the recompiler to take the execution context such as kernel mode into account when compiling blocks. That's why it's necessary to identify blocks not just by address but all the information used at compile time. This is done by computing a 32-bit key and using that as a block's identifier instead of the last six physical address bits like was done before.

Since we have now 32-bit instead of 6-bit keys, the block() function hashes the keys before mapping them to one of the 64 pool rows. The hash function was chosen arbitrarily to be better than a simple multiplicative hash and is likely not the best choice for this exact task.

  • Pass JITContext down to leaf emit functions.
  • Emit inline implementations of basic 64-bit operations.
  • Use block compile-time information to elide kernel mode checks of the now inlined operations.

ghost avatar Sep 08 '24 14:09 ghost

The GDB::server.hasBreakpoints() check is a bit problematic because right now it gets updated only when Context::setMode is called even though breakpoints might be added or removed at other times as well. I suppose it could be polled for every block perhaps? I considered it part of the execution context but before it wasn't taken into account at all when looking up blocks. To me it seems like breakpoints only worked after all Pool instances where flushed.

ghost avatar Sep 10 '24 19:09 ghost

Sorry for the delay. I will try to review this in the next few days.

invertego avatar Oct 10 '24 04:10 invertego

Noting here that I've transferred to a new GitHub account: @pekkavaa I'd like to continue ares JIT performance work at some point but maybe with a solid benchmarking setup first. Even small changes should get reliably measured.

Edit: And thanks for the review, invertego!

ghost avatar Jan 19 '25 11:01 ghost

Hello again :wave: I rebased the branch and addressed all the review comments. I also squashed the two commits together since it got annoying to move changes in between them. Things seem to work still and it passes N64 systemtest. The test ROM looks different now than what I recall though: image

How much more are you planning to do with this before taking it out of draft status?

I'd like to get an idea on the performance impact and then adjust at least the block pool size accordingly. Any ideas how to benchmark ares are welcome :)

pekkavaa avatar May 05 '25 14:05 pekkavaa

I simplified the implementation. The cached CPU state has only the bits enabled that are actually used by the recompiler. I also got rid of the hash and store the tags in their own array.

Performance is slightly degraded in SM64 starting area: 139 VPS vs 144 VPS of master. I'd like to get it to be even or better before merging.

One idea I got was to reference blocks with an 32-bit offset instead of a full Block*. I got it working on top of master (commit) but it didn't improve the frame timings, but perhaps it would alleviate the now bloated Pool struct.

pekkavaa avatar May 08 '25 20:05 pekkavaa

This needs rebasing after the calling convention PR went in.

I'd still like to merge this even if it seems to regress on performance because it does two things that seem on the good path: 1) create the jitcontext thing which is mandatory for good performance 2) inline many ops

I also notice that there's always a performance downgrade when you start inlining stuff rather than calling external function. So we either assume all jits in other emulators are wrong, or this trend has to tip at some point and then somehow switch into an improvement. It might be that we still bloat blocks too much because of per-instruction overhead. Fixing that might eventually recover what seems lost performance now.

rasky avatar Sep 14 '25 20:09 rasky

Rebased against master and confirmed that n64-systemtest still passes.

I looked at the implementation again now and see that the computed tag doesn't affect the index of a computed block. In practice this means the recompiler's cache will get thrashed if the tag bits change constantly. I recall commenting out the unused bits so it doesn't happen anymore, but recommend coming up with a smarter scheme in the future.

pekkavaa avatar Sep 15 '25 13:09 pekkavaa

Ok that might explain the reason why there is a speed pessimization rather than speedup. The context key should partecipare with the address as index for the cache, so that you basically LRU among blocks whose keys are (context,address).

BTW: another rebase now should green the CI.

rasky avatar Sep 16 '25 11:09 rasky

I updated the logic, see the commit message. Basically one layer of indirection more for Blocks and a cache. The old Pool objects store the latest instance. Similar to what was discussed on Discord.

Performance is now almost on par with master. I'm happy with it.

pekkavaa avatar Sep 20 '25 18:09 pekkavaa

Also I took some "traces" of the block behavior. Here are some quick numbers (data in the same repo): https://github.com/pekkavaa/2025-09-20-recompiler-block-tracing/blob/master/docs/1_trace_stats.md

For example in SM64 castle grounds, there are 33.21k compiled block lookups per frame but only 4093 unique addresses. 36% of block cache hits are to the same address (idle loop?). So the keys are sparse and accesses unevenly distributed. This validates the old design (sparse array, very fast lookups) so I tried to stay close to it even with the state bits added.

One optimization idea: the PoolRow struct could be shrunk to 64 bits if the Block* block ptr is replaced with an u32 offset.

pekkavaa avatar Sep 20 '25 18:09 pekkavaa