Perf: <<ByteMark to 80% native>> Roadmap
<<ByteMark to 80% native>> Roadmap
Bytemark performance tracking: Graph
Based on the systemic profiling/perf work of the past 2 weeks, + skmp/optihacks-1, skmp/optihacks-2 and skmp/optihacks-3 + some analysis on ByteMark today I think the following is a good game plan for this goal:
Cleanup changes in skmp/optihacks-3.
- Compat limits: Optional PF disable, Optional InvalidateFlags for ABI crossings.
Lightweight guest branching & dispatch
Targets: Switch tables, indirect functions
Reduce Lookup overhead
Current paged lookup + alias check + validation check is clearly not optimal.
I suggest a 2-layer approach, with the first layer being a cache and the second a tree / paged tree.
For the first layer, I'd use 24 bit lookup + alias check + lazy allocate LUT.
Basic Structure
struct { uint64_t guest; uint64_t host } entry_t;
entry_t *LUT; // possibly mmap + segfault backed for lazy allocation
Indirect Code Lookup
_fast_loopup:
and x0, pc , ( (1 << 24) -1)
add entry_ptr, lookup_base + x0* 16
ldp x0,x1, [entry_ptr]
cmp x0, pc
b.ne _full_lookup<pc_reg> // we need a detwiddling table here
br x1
Block linking
Blocks that are statically mapped should link to each other. I propose to use indirect branches to implement this, with the branch vectors being allocated near the block.
Basic Structure
struct BlockInfo { .... uintptr_t* StaticBranchHostPtr; uint64_t StaticBranchGuest; }
// during code emition do a _non_mapped_handler: .dq addr <default_handler> and initialize StaticBranchHostPtr
Block ending for blocks that exit with CALL_DIRECT, JUMP_DIRECT
ldr x0 =_non_mapped_handler
blr x0 // BLR is important here, doesn't return
Block ending for blocks that exit with RET, CALL_INDIRECT, JUMP_INDIRECT
and x0, pc , ( (1 << 24) -1)
add entry_ptr, lookup_base + x0* 16
ldp x0,x1, [entry_ptr]
cmp x0, pc
b.ne _full_lookup<pc_reg> // we need a detwiddling table here
br x1
PC-recovery
In order to reduce overhead, no validation is done on the DIRECT forms - so the default case needs to handle that. We can recover the block from the ret address (that's why BLR is needed). Then we can link the block
Block link metadata
We need to keep lists of which blocks link to witch for block invalidations
Static Register Allocation
Allocate 8 or 16 GPRs statically, do RA for SSA values on the rest regs. Make sure to support "lifetime sharing" when an SSA should share the host register with a guest reg as long as it is valid, and generate movs as needed.
Multiblock
Multiple Entry Points
Right only the main entry point is exported to the cache. Big blocks that call other blocks should export secondary entry points at the expected return points, to avoid multiple partial code compilations of the same function
PHI nodes (possibly not needed to meet goals)
We need the RA to support PHI nodes
MB-DCLSE (possibly not needed to meet goals)
We need Dead Context Load Store Elim to generate PHI nodes
Address important pathological code gen
Shuffles are one example, and there might be a few more important cases for ByteMARK
@Sonicadvance1 @phire thoughts?
Did a first pass on the dispatch optimizations, see skmp/reduce-dispatch-overhead and skmp/optihacks-4.
Overall, ByteMARK perf dropped, though I'm not 100% certain on the numbers yet. The implementation is mostly full, but not optimal, so it generates slightly larger code per OP_EXITFUNCTION.
Further optimizations on that branch:
- Helper function pools every N bytes to be able to use direct jumps
- Dispatch data pools so that structures are more cache friendly
- Patch the code to direct links whenever possible.
UT2004 and FTL show ~ 20% perf win, especially in more complex scenes
update Looks like there is a general performance regression, even on native bytemark. OS issue?
Implemented multiple entry points in skmp/multiple-entry-points (on top of optihacks-4).
Perf results are mixed, with bytemark being slightly slower, FTL gaining a few FPS in complex points, and metro being noticeably slower.
This is likely because of (far) larger code gen making the L1i issues worse. We'll likely need to hide this behind an option.
emfloat/fpemulation benchmark hits a pathological case of cmovcc/setcc. Fixing it tripled perf
Added block sorting on the frontend to avoid out of order jumps in the backend.
FEX
NUMERIC SORT : 581.03 : 14.90 : 4.89
STRING SORT : 140.48 : 62.77 : 9.72
BITFIELD : 5.1261e+08 : 87.93 : 18.37
FP EMULATION : 173.49 : 83.25 : 19.21
FOURIER : 15360 : 17.47 : 9.81
ASSIGNMENT : 26.391 : 100.42 : 26.05
IDEA : 2837.5 : 43.40 : 12.89
HUFFMAN : 1348.7 : 37.40 : 11.94
NEURAL NET : 17.319 : 27.82 : 11.70
LU DECOMPOSITION : 554.5 : 28.73 : 20.74
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 52.613
FLOATING-POINT INDEX: 24.078
qemu
NUMERIC SORT : 498.19 : 12.78 : 4.20
STRING SORT : 128.88 : 57.59 : 8.91
BITFIELD : 2.9591e+08 : 50.76 : 10.60
FP EMULATION : 193.78 : 92.99 : 21.46
FOURIER : 3620 : 4.12 : 2.31
ASSIGNMENT : 16.914 : 64.36 : 16.69
IDEA : 2253.9 : 34.47 : 10.24
HUFFMAN : 1074.7 : 29.80 : 9.52
NEURAL NET : 3.8008 : 6.11 : 2.57
LU DECOMPOSITION : 128.52 : 6.66 : 4.81
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 41.976
FLOATING-POINT INDEX: 5.511
native
NUMERIC SORT : 1555.8 : 39.90 : 13.10
STRING SORT : 460.91 : 205.95 : 31.88
BITFIELD : 5.7356e+08 : 98.39 : 20.55
FP EMULATION : 660.29 : 316.84 : 73.11
FOURIER : 89283 : 101.54 : 57.03
ASSIGNMENT : 59.728 : 227.28 : 58.95
IDEA : 10575 : 161.74 : 48.02
HUFFMAN : 3933.6 : 109.08 : 34.83
NEURAL NET : 79.738 : 128.09 : 53.88
LU DECOMPOSITION : 2179.2 : 112.89 : 81.52
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 139.480
FLOATING-POINT INDEX: 113.656
Add --enable-unsafe-pass=<pass name> to allow unsafe optimization passes to be enabled from arguments.
Allow multiple of these (Look at how -E is implemented) and pass it to FEXCore for the PassManager to pick up and conditionally enable these patches.
Will be important for per-game optimization passes
Merges from skmp/optihacks-4
Merged
Done in IR improvements (#484)
- [x] 275c390 IR: Fix writer to handle more RA classes
Done in profile improvements (#485)
- [x] be23161 JitSymbols: Append HostAddr to name
Done in #464
- [x] b03c97c DCE: Needs reverse iteration to be effective
Done in #480
- [x] 49d6968 Arm64: Avoid branches to next block of possible
- [x] d21bb5d Useless Branch Elimination: Add OP_JUMP, Improve OP_CONDJUMP
- [x] f42c06a Frontend: Sort blocks for better branching
Done in IR/Select (#488)
- [x] 8d5beb9 Select with size
- [x] 904f358 Select: Add Size for 32/64 bit data selection
Done in Frontend/setcc/cmov (#487. #491)
- [x] ef13285 OpDisp: More efficient handling of setcc and cmov
Done in #481
- [x] f93e69c Dynamic Shifts: use Selects instead of branches for flags
Done in backend/arm64 (#483)
- [x] 1d1054d MUL: remove zext/sexts from 32 bit form
Todo in Select/Inline Consts (#488)
- [x] 7b7aa3f SELECT/arm: inline const cset, bfe & and elim
- [x] 5c9b8d1 Select: Imm for second compare argument
Todo in RCLSE improvements (#482)
- [x] 2fdb20c RCLSE: Also optimize when Access is ACCESS_PARTIAL_READ
- [x] de6d116 Some lse
- [x] 0ddaf47 More LSE fixes
Todo in enhanced CondJump branching (#490)
- [x] 702c4d3 Almost flawless change in JumpCond
- [x] 8f29da4 CondJump: Now with cmp/b.cc support
- [x] ac19576 CondJump: Only optimize for jit
- [x] 6685a28 CondJump/Arm64: Use cbnz/cbz if faster
- [x] 4e12d7b CondJump: Only optimise select if no cond is already used
Todo in #479
- [x] d5b5d5a OptPasses: Add DeadGPRStore
Todo in ABIOpts - invalidate-flags (#486)
- [x] 6261343 IR: InvalidateFlags + ABI heuristics
Todo in ABIOpts - skip-pf (#486)
- [x] 955d6fd OpDisp: Disable PF generation for ALU
Todo in Complex Load/Store Address Generation (#473)
- [x] 0cf9de3 Load/Store mem with two args
- [x] 7d1d1f2 Add arm64 backend, disable for TSO
- [x] 808cf8d Fix unified mem, Add interpreter
- [x] 184610a LoadMem/StoreMem: Add OffsetType, OffsetScale
- [x] dad065c ConstProp: Do inline constants in a separate loop
- [x] 84edced TSO: Only emit if enabled
- [x] 637b2f9 IR/Arm64: Inline Constants for MemoryOps
- [x] d67c2a3 ConstProp: Fix inline consts for load/storemem
Todo in Frontend/CC bypass (#489)
- [x] 13b7d24 Partial cmp forwarding
- [x] b670fa9 More cmp ops
- [x] 37079e0 Improved cmp/jcc forwarding
- [x] 0343f0d Generalize cmpOp to flagsOp, add test variants
- [x] a975bae jcc/cmp forwards: add support for 1,2 bytes
- [x] 3f26e7f opdisp: setcc, cmov fastpaths
Blocked on RA fixes
Todo in IR Constant Pooling (#339)
- [x] 10c9b74 ir-constant-pooling
Pending RFC feedback
Todo in BlockCache / L1C (#496)
- [x] 4b76e48 L1 cache for C++ and x86 jit
- [x] 47bd851 Dispatch on block ends, jmps instead of calls
- [x] 2951758 partial arm64 support
- [x] e3d8589 arm64: working ddisp, l1c
- [x] 9e5ff94 arm64: Remove nop sp movs
Todo in Linking (#497)
- [x] 3dd95dd ExitFunction: Now takes exit address. PC isn't stored to context until it is needed
- [x] 887c24b OP_EXIT: Inline Consts
- [x] 2d4aed6 Basic linking for const ExitFunctions. Doesn't invalidate
TODO
With first SRA impl (skmp/optihacks-5)
--------------------:------------------:-------------:------------
NUMERIC SORT : 793.81 : 20.36 : 6.69
STRING SORT : 168.68 : 75.37 : 11.67
BITFIELD : 4.6885e+08 : 80.42 : 16.80
FP EMULATION : 217.03 : 104.14 : 24.03
FOURIER : 16576 : 18.85 : 10.59
ASSIGNMENT : 31.49 : 119.83 : 31.08
IDEA : 3568.4 : 54.58 : 16.20
HUFFMAN : 1681.4 : 46.62 : 14.89
NEURAL NET : 19.84 : 31.87 : 13.41
LU DECOMPOSITION : 555.38 : 28.77 : 20.78
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 62.953
FLOATING-POINT INDEX: 25.856
SRA + some mov elim + properly cooled laptop
--------------------:------------------:-------------:------------
NUMERIC SORT : 1107.1 : 28.39 : 9.32
STRING SORT : 191.97 : 85.78 : 13.28
BITFIELD : 5.8174e+08 : 99.79 : 20.84
FP EMULATION : 281.38 : 135.02 : 31.16
FOURIER : 18921 : 21.52 : 12.09
ASSIGNMENT : 43.512 : 165.57 : 42.95
IDEA : 3967.7 : 60.68 : 18.02
HUFFMAN : 1810.1 : 50.20 : 16.03
NEURAL NET : 25.056 : 40.25 : 16.93
LU DECOMPOSITION : 673.78 : 34.91 : 25.20
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 77.339
FLOATING-POINT INDEX: 31.151
SRA + full width mov elim + cooled laptop
--------------------:------------------:-------------:------------
NUMERIC SORT : 1205.9 : 30.93 : 10.16
STRING SORT : 208.36 : 93.10 : 14.41
BITFIELD : 5.8431e+08 : 100.23 : 20.94
FP EMULATION : 300.39 : 144.14 : 33.26
FOURIER : 19606 : 22.30 : 12.52
ASSIGNMENT : 45.535 : 173.27 : 44.94
IDEA : 4304.7 : 65.84 : 19.55
HUFFMAN : 1947.8 : 54.01 : 17.25
NEURAL NET : 25.579 : 41.09 : 17.28
LU DECOMPOSITION : 671.91 : 34.81 : 25.14
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 82.326
FLOATING-POINT INDEX: 31.711
libpng decode mainloop translation example -- codegen is starting to look quite optimal in some cases
str x4, [x24, #48]
mov x4, #0x3f51 // #16209
ldr x5, [x28, #88]
str w4, [x5, #8]
b 0xffefde33a2e8
ldur w4, [x26, #-60]
sub x21, x27, x4
b 0xffefde33a9a8
mov x27, x20
mov x21, x25
ldrb w4, [x21]
ldr w5, [x28, #96]
sub w5, w5, #0x3
str x5, [x28, #96]
add x25, x21, #0x3
strb w4, [x27]
ldrb w4, [x21, #1]
strb w4, [x27, #1]
ldrb w4, [x21, #2]
str x4, [x28, #104]
add x20, x27, #0x3
sturb w4, [x20, #-1]
cmp w5, #0x2
b.hi 0xffefde33a9a0 // b.pmore
ldr w4, [x28, #96]
cbz w4, 0xffefde33aa70
ldrb w20, [x21, #3]
ldr w4, [x28, #96]
strb w20, [x27, #3]
cmp w4, #0x2
b.ne 0xffefde33ae64 // b.any
ldrb w21, [x21, #4]
With experiemtal SRA16 (uses 6 caller saved regs)
--------------------:------------------:-------------:------------
NUMERIC SORT : 1342.4 : 34.43 : 11.31
STRING SORT : 205.16 : 91.67 : 14.19
BITFIELD : 5.8536e+08 : 100.41 : 20.97
FP EMULATION : 289.51 : 138.92 : 32.06
FOURIER : 18247 : 20.75 : 11.66
ASSIGNMENT : 55.266 : 210.30 : 54.55
IDEA : 4801 : 73.43 : 21.80
HUFFMAN : 3047.4 : 84.51 : 26.99
NEURAL NET : 25.602 : 41.13 : 17.30
LU DECOMPOSITION : 674.25 : 34.93 : 25.22
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 92.386
FLOATING-POINT INDEX: 31.006
The slight drops in some cases are probably because of register pressure with only 9 temps available. I'll investigate further after getting it to pass FTL + UT2004.
With SRA16+16, some frontend improvements
--------------------:------------------:-------------:------------
NUMERIC SORT : 1387.8 : 35.59 : 11.69
STRING SORT : 264.09 : 118.00 : 18.26
BITFIELD : 5.8385e+08 : 100.15 : 20.92
FP EMULATION : 292.71 : 140.45 : 32.41
FOURIER : 32888 : 37.40 : 21.01
ASSIGNMENT : 59.544 : 226.58 : 58.77
IDEA : 4829 : 73.86 : 21.93
HUFFMAN : 3054.2 : 84.69 : 27.05
NEURAL NET : 30.42 : 48.87 : 20.56
LU DECOMPOSITION : 1044.3 : 54.10 : 39.07
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 97.495
FLOATING-POINT INDEX: 46.241
With (skmp/optihacks-6)
- Useless masking + (unstable) relaxed aliasing
- VMOV/VINS backend ops
- (some) VMOV elimination
- Better SRA-prewrite tracking
--------------------:------------------:-------------:------------
NUMERIC SORT : 1359.2 : 34.86 : 11.45
STRING SORT : 305.83 : 136.65 : 21.15
BITFIELD : 5.8103e+08 : 99.67 : 20.82
FP EMULATION : 304.13 : 145.93 : 33.67
FOURIER : 37772 : 42.96 : 24.13
ASSIGNMENT : 57.844 : 220.11 : 57.09
IDEA : 5539.5 : 84.72 : 25.16
HUFFMAN : 3324.6 : 92.19 : 29.44
NEURAL NET : 53.144 : 85.37 : 35.91
LU DECOMPOSITION : 1978.6 : 102.50 : 74.01
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 102.529
FLOATING-POINT INDEX: 72.167
With (skmp/optihacks-6)
- BFE elim
- Flags Load/Store optimizations
- fcmp fastpaths
- OP_FCMP optimization
--------------------:------------------:-------------:------------
NUMERIC SORT : 1407.5 : 36.10 : 11.85
STRING SORT : 358.54 : 160.20 : 24.80
BITFIELD : 5.8023e+08 : 99.53 : 20.79
FP EMULATION : 304.38 : 146.05 : 33.70
FOURIER : 42118 : 47.90 : 26.90
ASSIGNMENT : 58.167 : 221.34 : 57.41
IDEA : 5654.9 : 86.49 : 25.68
HUFFMAN : 3339.4 : 92.60 : 29.57
NEURAL NET : 58.697 : 94.29 : 39.66
LU DECOMPOSITION : 2153 : 111.54 : 80.54
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 105.863
FLOATING-POINT INDEX: 79.565
Merges from skmp/optihacks-6
2nd merge wave ~
Merged
Constprop
Mul to Lshl (#553)
- [x] (57e7a997) ConstProp: Mul x,#powof2 -> Lshl x,clz(#powof2)
CMT around Select (#554)
- [x] (2dd38128) CMT: Move unary ops before selects, if both True & False values are consts
FCMP opimization (#556)
- [x] (df624b07) ConstProp: FCMP Optimization
- [x] (acf426fb) ConstProp: Fix checks on FCMP
Memop imm pooling (#557)
- [x] (958dd617) ConstProp: LDR imm pooling
LSE (#544)
- [x] (8b175880) LSE: Also optimize away partial writes
Improve SS/SD frontend (#543)
- [x] (3b6bba64) OpDisp: Avoid some pointless vinserts
- [x] (d155c08f) x86tables: Mark some SD sse opcodes as 64-bit to reduce overhead
Improve REP frontend (#542)
- [x] (d7ec1ea8) OpDisp: Read DF outside REP loops as it won't change
backend opt (#551)
- [x] (e9bc02ac) arm64: Optimize VIns* to only use temp reg if it must
- [x] (ebb34e1f) arm64: Avoid needless movs in OP_VMOV
Dead FPR Store Elim (#545)
- [x] (6d8913ae) IR: Add a rather ieffective DeadFPRStoreElim
- [x] (b9653006) IR: DeadFPRStoreElim
IR changes
Sext as a special case of Sbfe (#546)
- [x] (9168a1a4) IR: Replace Sext with Sbfe
Move SetFlag logic to IR from backends (#547)
- [x] (63e05cdf) IR: Move BFEs to SetFlag, remove from OP_STOREFLAG
FCMP: Move complexity to backend where it is easier to handle it (#548)
- [x] (f98cd0a3) IR: Change FCMP to return x86-style flags
Already merged to master
- [x] (ce2dda80) OpDisp: Flag calculation bypass
- [x] (74254a1a) IR: Load/Store mem with two args
- [x] (2ff55a04) Add arm64 backend, disable for TSO
- [x] (f0cedab3) Fix unified mem, Add interpreter
- [x] (9fdd2a5b) LoadMem/StoreMem: Add OffsetType, OffsetScale
- [x] (552f5db0) TSO: Only emit if enabled
- [x] (703a025a) IR/Arm64: Inline Constants for MemoryOps
- [x] (b82c8b9e) ConstProp: Fix inline consts for load/storemem
- [x] (0e329f4e) Load/StoreMem: Cleanup Complex address gen
- [x] (d7191891) Load/Store: Use ->Addr and ->Offset in Interpreter
- [x] (f3817991) IR: Swith MemoryOffsetType to a type from uint8_t
- [x] (c5316b1f) IR Loader: Add support for MemOffsetType, Fix %Invalid
- [x] (1073d405) IR Tests: Update LoadMem to use extended syntax
- [x] (4ec41088) Typofix: UnwarpNode -> UnwrapNode
- [x] (92eb1405) OpDisp: Rename TSO helper to _Load/_StoreAutoTSO
- [x] (e228295f) Jit: Remove conditional TSO on backends
Dead flags (#550)
- [x] (d90930aa) DFSE: Use uint64_t for flag bitmask, we use up to bit 47 w/ x87 flags
- [x] (4ec1ac7b) OpDisp, IR: Extend InvalidateFlags to invalidate a bitfield, invalidate PF when not emulated
Debug data (#552)
- [x] (b8fd84ab) DebugData: Add Subblock list, populate it from arm64 backend
Misc (unused)
- [ ] (7095c7eb) IR: Add OP_WEAKREF
IR: Support floats in conditionals (#549)
- [x] (4f55b425) OpDisp, IR, Backends: Add Float compares to Select, CondJump, COND_F*
ConstProp: Masking elimination (#555)
- [x] (10983521) ConstProp: Add RemoveUselessMasking, implement for a few ops
- [x] (d906e0bd) ConstProp: Documented different sub-passes, remove leftover continue; statements
- [x] (f3847074) ConstProp: Special case BFE, AND in RemoveUselessMasking sub-pass
- [x] (8d322751) ConstProp: SBFE useless masking elim
- [x] (acde1468) ConstProp: Elim Bfe on ops that zext by default
- [x] (ed3fea30) ConstProp: Remove some BFEs that do nothing
- [x] (1961f9f0) ConstProp: Fix IsBfeAlreadyDone
- [x] (7315a672) ConstProp: Add some VMOV elimination
- [x] (4a7dcc64) ConstProp: More VMOV Elim
SRA (Draft: #524)
- [x] (2e2585a8) SRA: Initial scaffolding
- [x] (7328a168) SRA: Some changes on RA
- [x] (aae70dda) SRA: initial x86 backend impl
- [x] (86e97930) SRA: Fix should interpret check
- [x] (035ea518) SRA: Somewhat working impl
- [x] (c2c9a0fb) SRA: Basic aarch64 impl
- [x] (83ee6b6c) SRA: MAp 16 regs, arm64 only
- [x] (df978135) SRA: Bytemark working with 10 regs for arm64
- [x] (45e194da) SRA: Towards load-mov elim
- [x] (766b8834) SRA: load-mov elim. allocation fix
- [x] (6cec3778) SRA: load-mov elim, RA integration, runs bytemark
- [x] (d5e902ed) SRA: Fix thunks/callback + SRA spilling
- [x] (fd7ac1dc) SRA: Rework RA-side, add pre-writting, add debug prints
- [x] (bc219525) SRA: RematCost of 1 makes RA think value is OP_CONST
- [x] (5d4d697c) SRA: Experimental use of caller saved regs for SRA16
- [x] (b9aea257) RA: Don't SRA-alias global values
- [x] (e4719d01) arm64: Spill in the correct order in OP_THUNK
- [x] (ffbadfcf) SRA: WIP: Maybe some bugs fixed, more asserty
- [x] (79b987f1) SRA: WIP: More hotfix work
- [x] (306b7dd9) SRA/RA: Add Interference only on related classes
- [x] (1fea040c) SRA/RA: Re-enable pre-writes, remove spill prints
- [x] (e84f97b2) SRA: Cleanup some helpers
- [x] (2e9b7dcb) SRA: Correctly track which StoreRegister clears Prewritten
- [x] (befc3bd1) arm64: Make ror zext when size is 1 or 2
- [x] (335566d5) SRA/RA: Set re-loaded aliased spans to written as we can't track future writes to them
- [x] (3661b9ce) SRA: Wip cleanup
- [x] (3a9cf77e) SRA: Somewhat cleaned up
- [x] (c67ef51e) SRA: Float support arm64, disabled alias/prewrite
- [x] (384b02be) SRA: Enable alias/prewrite for GPR only
- [x] (ee9f83fb) SRA: Multiclass alias & prewrite
- [x] (d237fcfb) SRA: Fix SRA::IsStaticAllocFpr
- [x] (a97fa351) SRA: Cleanups
- [x] (d67efbb4) SRA: Allow some partial register aliasing
- [x] (32af1d21) RA/SRA: Better pre-write invalidation detection
- [x] (9d8cd60c) arm64: Fix OP_LOADREG w/ simd w/ nonzero offset for 4,8
Unsynchronized RIP updates (#637)
L1C (#496)
To rework
Block linking w/ backpatching (#497)
Trying out projects to keep track of this -> https://github.com/FEX-Emu/FEX/projects/2 <-