R5900: Improve the EE cache performance
Description of Changes
Entry prefetching
I shrunk the size of a TLB entry from 48 bytes to 16 bytes. Theoretically on a lookup, we would prefetch up to 3 other TLB entries (due to the 64 byte cache length), which is nice because the hottest code looks up entries linearly.
Because I made the mistake of assuming this was any sort of bottleneck without checking, this actually slowed things down. We weren't memory bound here and the precomputed entry values that were bloating the structure were actually beneficial.
This optimization combined with the ones below turned out to be an improvement, so it is present in this PR.
Common Subexpression Elimination
From
for(int i = 0; i < 48; i++)
{
if(entry_list[i].isCached())
{
// do work with entry_list[i]
}
}
Into
for(int i = 0; i < 48; i++)
{
const tlbentry& entry = entry_list[i];
if(entry.isCached())
{
// do work with entry
}
}
Because of how hot this code is, I wanted to help out the compiler and processor some. Instead of constantly indexing into the array during our entry accesses, we create a reference of it at the top of the loop. This is a common pattern so I was hoping to hit some sort of compiler heuristic, or at least access memory in a more cache friendly way. Turns out it does as I saw a general speed increase and we were less memory bound by around 0.6%.
Only Check Valid Entries
Instead of looking through every cache entry to see if a specific address should be cached, we can instead build a separate list of "cached entries" and only look through those. This was the most significant optimization. It reduced the number of branches and increased branch prediction accuracy.
Overall I've seen a performance increase of around 20% .
Rationale behind Changes
I want to get more familiar with VTune profiling. The EE cache is also very slow.
Suggested Testing Steps
Test games that require EE cache with this PR (ensure any patches we have for the game are disabled) Run the EE cache and compare the speed to master.