dolphin
dolphin copied to clipboard
Cached Interpreter 2.0
It now supports variable-sized data payloads and memory range freeing. It's a little faster, too.
This PR conflicts with https://github.com/dolphin-emu/dolphin/pull/12714, and I would prefer if this PR were merged first. The virtual member functions added to JitBase
for the JITWidget refresh are incompatible with this redesign. I already have updated functions in a separate branch that I would rather apply to the JITWidget refresh PR than this PR.
I tried to run this on my M1 MacBook Pro but it crashes as soon as I start a game. I tested this with Mario Kart Wii and the Wii Menu. Crash report: https://gist.github.com/Simonx22/ca3863d29a2f569fba1bbb65da6044f4
Edit: oops I didn't see that you just pushed something I'll re-try with the latest changes shortly.
The latest commit was to avoid some baggage from sub-classing Common::CodeBlock
. I see that the function I now avoid, Common::AllocateExecutableMemory
, does have a special case for Apple platforms. Hopefully all is fixed?
Yeah, it works now (and it's ~2 FPS faster than latest dev)
Latest dev:
This PR:
I simplified the Common::CodeBlock
template for non-executable memory by using Common::AllocateMemoryPages
rather than std::malloc
. Now there doesn't need to be special cases for FreeCodeSpace
, WriteProtect
, and UnWriteProtect
. @JosJuice informed me on the Discord about the reason for the prior breakage, saying "Using executable memory without Common::ScopedJITPageWriteAndNoExecute
would definitely break things on Apple Silicon." Knowing that, I believe this change should be fine.
In my non-scientific tests, this PR also gives me +1.5-2.0 FPS increase on a Retroid Pocket 2S. Dolphin is getting so ridiculously optimized that more and more games are becoming playable on potato ARM hardware. Amazing work, guys!
Can we get this merged? 🙏
Why do you want this merged? The speed gains are minimal and don't affect the CPU JIT, it adds complexity, and the author themself has admitted that they think this approach is ultimately a dead-end that cannot be improved further.
I feel that is a slight misrepresentation of what I have said. I have been doing many experiments for what I wanted to do next with the Cached Interpreter, but at the same time Sam Belliveau's tests taught me that the primary bottleneck of the Cached Interpreter appears to be the number of function pointers emitted. That is to say, the unpredictable branch every few tens of host instructions is cratering the performance. So my takeaway from this fun distraction project (and roughly what I said in the Discord chat) is that the concept of the Cached Interpreter itself is flawed.
I wanted to create optimized callbacks for the most common instructions that evaluated known info at recompile-time, such as r0
as a source GPR for addi
and load/store instructions. However, I found this added complexity (heavily templated functions and structs, separation from being a simple wrapper around Interpreter functions) that not I nor anyone else really wants, so the small performance gains I achieved are probably not worth that. The biggest issue I had to wrestle with was how complex branch instructions were becoming in my attempts to optimize them while supporting things like optional Branch Watch, optional Software JIT Profiling, optional block linking, the LK bit, and conditional branches. I couldn't even finish it.
All this being said, the spot my Cached Interpreter 2.0 is at without all those aforementioned planned changes (this PR as-is) is a notable improvement over the Cached Interpreter currently in Dolphin; I would estimate it is ~25-30% faster. My changes to the Cached Interpreter's emission and dispatch of callbacks also makes it remarkably similar to the design of the other JITs. While it is objectively more complex, I also believe this homogenizes the design of the JITs in a positive way. Having fewer design differences between JITs could make the Cached Interpreter easier to comprehend, not to mention the improved capabilities now that it can free memory and have varied payload sizes.
So I don't plan on pursuing any follow-up development of the Cached Interpreter, other than to add Software JIT Profiling support. I have asked @JosJuice to review the PR whenever is most convenient so it can be finalized and merged. As an addendum, it is my understanding that @Sam-Belliveau is okay with my take on an improved Cached Interpreter getting merged over his so long as his experiments were insightful, which they were. The final commit of my PR, combining multiple callbacks into one, is the essence of Samb's experiments, detracting some callback combining I felt was extraneous.
I am thinking it may be worth stepping back to passing Interpreter&
to the callbacks rather than PowerPC::PPCState&
. I went with PowerPC::PPCState&
because it was the forward-thinking choice for when optimized callbacks would be implemented, but I'm no longer pursuing that. If nothing else, it should be more optimal for memory usage.
The aforementioned change seems to have caused a very small performance regression, so I won't be going forward with it.
Needs a rebase.
I'm a fan of the new PowerPC::CheckAndHandleBreakpoints
function! It reminds me of what I tried to do with https://github.com/dolphin-emu/dolphin/pull/12720 but gave up on.
Nice stuff, those of us working on non-JIT platforms are following these small improvements.