<<ByteMark to 80% native>> Roadmap

Bytemark performance tracking: Graph

Based on the systemic profiling/perf work of the past 2 weeks, + skmp/optihacks-1, skmp/optihacks-2 and skmp/optihacks-3 + some analysis on ByteMark today I think the following is a good game plan for this goal:

Cleanup changes in `skmp/optihacks-3`.

Compat limits: Optional PF disable, Optional InvalidateFlags for ABI crossings.

Lightweight guest branching & dispatch

Targets: Switch tables, indirect functions

Reduce Lookup overhead

Current paged lookup + alias check + validation check is clearly not optimal.

I suggest a 2-layer approach, with the first layer being a cache and the second a tree / paged tree.

For the first layer, I'd use 24 bit lookup + alias check + lazy allocate LUT.

Basic Structure

struct { uint64_t guest; uint64_t host } entry_t;
entry_t *LUT; // possibly mmap + segfault backed for lazy allocation

Indirect Code Lookup

_fast_loopup:
and x0,  pc , ( (1 << 24) -1)
add entry_ptr, lookup_base + x0* 16
ldp x0,x1, [entry_ptr]
cmp x0, pc
b.ne _full_lookup<pc_reg> // we need a detwiddling table here
br x1

Block linking

Blocks that are statically mapped should link to each other. I propose to use indirect branches to implement this, with the branch vectors being allocated near the block.

Basic Structure

struct BlockInfo { .... uintptr_t* StaticBranchHostPtr; uint64_t StaticBranchGuest; }

// during code emition do a _non_mapped_handler: .dq addr <default_handler> and initialize StaticBranchHostPtr

Block ending for blocks that exit with CALL_DIRECT, JUMP_DIRECT

ldr x0 =_non_mapped_handler
blr x0 // BLR is important here, doesn't return

Block ending for blocks that exit with RET, CALL_INDIRECT, JUMP_INDIRECT

and x0,  pc , ( (1 << 24) -1)
add entry_ptr, lookup_base + x0* 16
ldp x0,x1, [entry_ptr]
cmp x0, pc
b.ne _full_lookup<pc_reg> // we need a detwiddling table here
br x1

PC-recovery

In order to reduce overhead, no validation is done on the DIRECT forms - so the default case needs to handle that. We can recover the block from the ret address (that's why BLR is needed). Then we can link the block

Block link metadata

We need to keep lists of which blocks link to witch for block invalidations

Static Register Allocation

Allocate 8 or 16 GPRs statically, do RA for SSA values on the rest regs. Make sure to support "lifetime sharing" when an SSA should share the host register with a guest reg as long as it is valid, and generate movs as needed.

Multiblock

Multiple Entry Points

Right only the main entry point is exported to the cache. Big blocks that call other blocks should export secondary entry points at the expected return points, to avoid multiple partial code compilations of the same function

PHI nodes (possibly not needed to meet goals)

We need the RA to support PHI nodes

MB-DCLSE (possibly not needed to meet goals)

We need Dead Context Load Store Elim to generate PHI nodes

Address important pathological code gen

Shuffles are one example, and there might be a few more important cases for ByteMARK

@Sonicadvance1 @phire thoughts?

Oct 25 '20 06:10 skmp

Did a first pass on the dispatch optimizations, see skmp/reduce-dispatch-overhead and skmp/optihacks-4.

Overall, ByteMARK perf dropped, though I'm not 100% certain on the numbers yet. The implementation is mostly full, but not optimal, so it generates slightly larger code per OP_EXITFUNCTION.

Further optimizations on that branch:

Helper function pools every N bytes to be able to use direct jumps
Dispatch data pools so that structures are more cache friendly
Patch the code to direct links whenever possible.

UT2004 and FTL show ~ 20% perf win, especially in more complex scenes

update Looks like there is a general performance regression, even on native bytemark. OS issue?

Oct 27 '20 16:10 skmp

Implemented multiple entry points in skmp/multiple-entry-points (on top of optihacks-4).

Perf results are mixed, with bytemark being slightly slower, FTL gaining a few FPS in complex points, and metro being noticeably slower.

This is likely because of (far) larger code gen making the L1i issues worse. We'll likely need to hide this behind an option.

Oct 28 '20 17:10 skmp

emfloat/fpemulation benchmark hits a pathological case of cmovcc/setcc. Fixing it tripled perf

Oct 29 '20 08:10 skmp

Added block sorting on the frontend to avoid out of order jumps in the backend.

FEX

NUMERIC SORT        :          581.03  :      14.90  :       4.89
STRING SORT         :          140.48  :      62.77  :       9.72
BITFIELD            :      5.1261e+08  :      87.93  :      18.37
FP EMULATION        :          173.49  :      83.25  :      19.21
FOURIER             :           15360  :      17.47  :       9.81
ASSIGNMENT          :          26.391  :     100.42  :      26.05
IDEA                :          2837.5  :      43.40  :      12.89
HUFFMAN             :          1348.7  :      37.40  :      11.94
NEURAL NET          :          17.319  :      27.82  :      11.70
LU DECOMPOSITION    :           554.5  :      28.73  :      20.74
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 52.613
FLOATING-POINT INDEX: 24.078

qemu

NUMERIC SORT        :          498.19  :      12.78  :       4.20
STRING SORT         :          128.88  :      57.59  :       8.91
BITFIELD            :      2.9591e+08  :      50.76  :      10.60
FP EMULATION        :          193.78  :      92.99  :      21.46
FOURIER             :            3620  :       4.12  :       2.31
ASSIGNMENT          :          16.914  :      64.36  :      16.69
IDEA                :          2253.9  :      34.47  :      10.24
HUFFMAN             :          1074.7  :      29.80  :       9.52
NEURAL NET          :          3.8008  :       6.11  :       2.57
LU DECOMPOSITION    :          128.52  :       6.66  :       4.81
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 41.976
FLOATING-POINT INDEX: 5.511

native

NUMERIC SORT        :          1555.8  :      39.90  :      13.10
STRING SORT         :          460.91  :     205.95  :      31.88
BITFIELD            :      5.7356e+08  :      98.39  :      20.55
FP EMULATION        :          660.29  :     316.84  :      73.11
FOURIER             :           89283  :     101.54  :      57.03
ASSIGNMENT          :          59.728  :     227.28  :      58.95
IDEA                :           10575  :     161.74  :      48.02
HUFFMAN             :          3933.6  :     109.08  :      34.83
NEURAL NET          :          79.738  :     128.09  :      53.88
LU DECOMPOSITION    :          2179.2  :     112.89  :      81.52
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 139.480
FLOATING-POINT INDEX: 113.656

Oct 29 '20 11:10 skmp

Add --enable-unsafe-pass=<pass name> to allow unsafe optimization passes to be enabled from arguments. Allow multiple of these (Look at how -E is implemented) and pass it to FEXCore for the PassManager to pick up and conditionally enable these patches. Will be important for per-game optimization passes

Oct 30 '20 08:10 Sonicadvance1

Merges from `skmp/optihacks-4`

Merged

Done in IR improvements (#484)

[x] 275c390 IR: Fix writer to handle more RA classes

Done in profile improvements (#485)

[x] be23161 JitSymbols: Append HostAddr to name

Done in #464

[x] b03c97c DCE: Needs reverse iteration to be effective

Done in #480

[x] 49d6968 Arm64: Avoid branches to next block of possible
[x] d21bb5d Useless Branch Elimination: Add OP_JUMP, Improve OP_CONDJUMP
[x] f42c06a Frontend: Sort blocks for better branching

Done in IR/Select (#488)

[x] 8d5beb9 Select with size
[x] 904f358 Select: Add Size for 32/64 bit data selection

Done in Frontend/setcc/cmov (#487. #491)

[x] ef13285 OpDisp: More efficient handling of setcc and cmov

Done in #481

[x] f93e69c Dynamic Shifts: use Selects instead of branches for flags

Done in backend/arm64 (#483)

[x] 1d1054d MUL: remove zext/sexts from 32 bit form

Todo in Select/Inline Consts (#488)

[x] 7b7aa3f SELECT/arm: inline const cset, bfe & and elim
[x] 5c9b8d1 Select: Imm for second compare argument

Todo in RCLSE improvements (#482)

[x] 2fdb20c RCLSE: Also optimize when Access is ACCESS_PARTIAL_READ
[x] de6d116 Some lse
[x] 0ddaf47 More LSE fixes

Todo in enhanced CondJump branching (#490)

[x] 702c4d3 Almost flawless change in JumpCond
[x] 8f29da4 CondJump: Now with cmp/b.cc support
[x] ac19576 CondJump: Only optimize for jit
[x] 6685a28 CondJump/Arm64: Use cbnz/cbz if faster
[x] 4e12d7b CondJump: Only optimise select if no cond is already used

Todo in #479

[x] d5b5d5a OptPasses: Add DeadGPRStore

Todo in ABIOpts - invalidate-flags (#486)

[x] 6261343 IR: InvalidateFlags + ABI heuristics

Todo in ABIOpts - skip-pf (#486)

[x] 955d6fd OpDisp: Disable PF generation for ALU

Todo in Complex Load/Store Address Generation (#473)

[x] 0cf9de3 Load/Store mem with two args
[x] 7d1d1f2 Add arm64 backend, disable for TSO
[x] 808cf8d Fix unified mem, Add interpreter
[x] 184610a LoadMem/StoreMem: Add OffsetType, OffsetScale
[x] dad065c ConstProp: Do inline constants in a separate loop
[x] 84edced TSO: Only emit if enabled
[x] 637b2f9 IR/Arm64: Inline Constants for MemoryOps
[x] d67c2a3 ConstProp: Fix inline consts for load/storemem

Todo in Frontend/CC bypass (#489)

[x] 13b7d24 Partial cmp forwarding
[x] b670fa9 More cmp ops
[x] 37079e0 Improved cmp/jcc forwarding
[x] 0343f0d Generalize cmpOp to flagsOp, add test variants
[x] a975bae jcc/cmp forwards: add support for 1,2 bytes
[x] 3f26e7f opdisp: setcc, cmov fastpaths

Blocked on RA fixes

Todo in IR Constant Pooling (#339)

[x] 10c9b74 ir-constant-pooling

Pending RFC feedback

Todo in BlockCache / L1C (#496)

[x] 4b76e48 L1 cache for C++ and x86 jit
[x] 47bd851 Dispatch on block ends, jmps instead of calls
[x] 2951758 partial arm64 support
[x] e3d8589 arm64: working ddisp, l1c
[x] 9e5ff94 arm64: Remove nop sp movs

Todo in Linking (#497)

[x] 3dd95dd ExitFunction: Now takes exit address. PC isn't stored to context until it is needed
[x] 887c24b OP_EXIT: Inline Consts
[x] 2d4aed6 Basic linking for const ExitFunctions. Doesn't invalidate

TODO

Oct 30 '20 15:10 skmp

With first SRA impl (skmp/optihacks-5)

--------------------:------------------:-------------:------------
NUMERIC SORT        :          793.81  :      20.36  :       6.69
STRING SORT         :          168.68  :      75.37  :      11.67
BITFIELD            :      4.6885e+08  :      80.42  :      16.80
FP EMULATION        :          217.03  :     104.14  :      24.03
FOURIER             :           16576  :      18.85  :      10.59
ASSIGNMENT          :           31.49  :     119.83  :      31.08
IDEA                :          3568.4  :      54.58  :      16.20
HUFFMAN             :          1681.4  :      46.62  :      14.89
NEURAL NET          :           19.84  :      31.87  :      13.41
LU DECOMPOSITION    :          555.38  :      28.77  :      20.78
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 62.953
FLOATING-POINT INDEX: 25.856

Nov 20 '20 10:11 skmp

SRA + some mov elim + properly cooled laptop

--------------------:------------------:-------------:------------
NUMERIC SORT        :          1107.1  :      28.39  :       9.32
STRING SORT         :          191.97  :      85.78  :      13.28
BITFIELD            :      5.8174e+08  :      99.79  :      20.84
FP EMULATION        :          281.38  :     135.02  :      31.16
FOURIER             :           18921  :      21.52  :      12.09
ASSIGNMENT          :          43.512  :     165.57  :      42.95
IDEA                :          3967.7  :      60.68  :      18.02
HUFFMAN             :          1810.1  :      50.20  :      16.03
NEURAL NET          :          25.056  :      40.25  :      16.93
LU DECOMPOSITION    :          673.78  :      34.91  :      25.20
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 77.339
FLOATING-POINT INDEX: 31.151

Nov 20 '20 16:11 skmp

SRA + full width mov elim + cooled laptop

--------------------:------------------:-------------:------------
NUMERIC SORT        :          1205.9  :      30.93  :      10.16
STRING SORT         :          208.36  :      93.10  :      14.41
BITFIELD            :      5.8431e+08  :     100.23  :      20.94
FP EMULATION        :          300.39  :     144.14  :      33.26
FOURIER             :           19606  :      22.30  :      12.52
ASSIGNMENT          :          45.535  :     173.27  :      44.94
IDEA                :          4304.7  :      65.84  :      19.55
HUFFMAN             :          1947.8  :      54.01  :      17.25
NEURAL NET          :          25.579  :      41.09  :      17.28
LU DECOMPOSITION    :          671.91  :      34.81  :      25.14
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 82.326
FLOATING-POINT INDEX: 31.711

Nov 20 '20 20:11 skmp

libpng decode mainloop translation example -- codegen is starting to look quite optimal in some cases

str    x4, [x24, #48]
mov    x4, #0x3f51                    // #16209
ldr    x5, [x28, #88]
str    w4, [x5, #8]
b    0xffefde33a2e8
ldur    w4, [x26, #-60]
sub    x21, x27, x4
b    0xffefde33a9a8
mov    x27, x20
mov    x21, x25
ldrb    w4, [x21]
ldr    w5, [x28, #96]
sub    w5, w5, #0x3
str    x5, [x28, #96]
add    x25, x21, #0x3
strb    w4, [x27]
ldrb    w4, [x21, #1]
strb    w4, [x27, #1]
ldrb    w4, [x21, #2]
str    x4, [x28, #104]
add    x20, x27, #0x3
sturb    w4, [x20, #-1]
cmp    w5, #0x2
b.hi    0xffefde33a9a0  // b.pmore
ldr    w4, [x28, #96]
cbz    w4, 0xffefde33aa70
ldrb    w20, [x21, #3]
ldr    w4, [x28, #96]
strb    w20, [x27, #3]
cmp    w4, #0x2
b.ne    0xffefde33ae64  // b.any
ldrb    w21, [x21, #4]

Nov 20 '20 20:11 skmp

With experiemtal SRA16 (uses 6 caller saved regs)

--------------------:------------------:-------------:------------
NUMERIC SORT        :          1342.4  :      34.43  :      11.31
STRING SORT         :          205.16  :      91.67  :      14.19
BITFIELD            :      5.8536e+08  :     100.41  :      20.97
FP EMULATION        :          289.51  :     138.92  :      32.06
FOURIER             :           18247  :      20.75  :      11.66
ASSIGNMENT          :          55.266  :     210.30  :      54.55
IDEA                :            4801  :      73.43  :      21.80
HUFFMAN             :          3047.4  :      84.51  :      26.99
NEURAL NET          :          25.602  :      41.13  :      17.30
LU DECOMPOSITION    :          674.25  :      34.93  :      25.22
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 92.386
FLOATING-POINT INDEX: 31.006

The slight drops in some cases are probably because of register pressure with only 9 temps available. I'll investigate further after getting it to pass FTL + UT2004.

Nov 22 '20 11:11 skmp

With SRA16+16, some frontend improvements

--------------------:------------------:-------------:------------
NUMERIC SORT        :          1387.8  :      35.59  :      11.69
STRING SORT         :          264.09  :     118.00  :      18.26
BITFIELD            :      5.8385e+08  :     100.15  :      20.92
FP EMULATION        :          292.71  :     140.45  :      32.41
FOURIER             :           32888  :      37.40  :      21.01
ASSIGNMENT          :          59.544  :     226.58  :      58.77
IDEA                :            4829  :      73.86  :      21.93
HUFFMAN             :          3054.2  :      84.69  :      27.05
NEURAL NET          :           30.42  :      48.87  :      20.56
LU DECOMPOSITION    :          1044.3  :      54.10  :      39.07
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 97.495
FLOATING-POINT INDEX: 46.241

Nov 27 '20 15:11 skmp

With (skmp/optihacks-6)

Useless masking + (unstable) relaxed aliasing
VMOV/VINS backend ops
(some) VMOV elimination
Better SRA-prewrite tracking

--------------------:------------------:-------------:------------
NUMERIC SORT        :          1359.2  :      34.86  :      11.45
STRING SORT         :          305.83  :     136.65  :      21.15
BITFIELD            :      5.8103e+08  :      99.67  :      20.82
FP EMULATION        :          304.13  :     145.93  :      33.67
FOURIER             :           37772  :      42.96  :      24.13
ASSIGNMENT          :          57.844  :     220.11  :      57.09
IDEA                :          5539.5  :      84.72  :      25.16
HUFFMAN             :          3324.6  :      92.19  :      29.44
NEURAL NET          :          53.144  :      85.37  :      35.91
LU DECOMPOSITION    :          1978.6  :     102.50  :      74.01
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 102.529
FLOATING-POINT INDEX: 72.167

Dec 01 '20 17:12 skmp

With (skmp/optihacks-6)

BFE elim
Flags Load/Store optimizations
fcmp fastpaths
OP_FCMP optimization

--------------------:------------------:-------------:------------
NUMERIC SORT        :          1407.5  :      36.10  :      11.85
STRING SORT         :          358.54  :     160.20  :      24.80
BITFIELD            :      5.8023e+08  :      99.53  :      20.79
FP EMULATION        :          304.38  :     146.05  :      33.70
FOURIER             :           42118  :      47.90  :      26.90
ASSIGNMENT          :          58.167  :     221.34  :      57.41
IDEA                :          5654.9  :      86.49  :      25.68
HUFFMAN             :          3339.4  :      92.60  :      29.57
NEURAL NET          :          58.697  :      94.29  :      39.66
LU DECOMPOSITION    :            2153  :     111.54  :      80.54
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 105.863
FLOATING-POINT INDEX: 79.565

Dec 03 '20 15:12 skmp

Merges from `skmp/optihacks-6`

2nd merge wave ~

Merged

Constprop

Mul to Lshl (#553)

[x] (57e7a997) ConstProp: Mul x,#powof2 -> Lshl x,clz(#powof2)

CMT around Select (#554)

[x] (2dd38128) CMT: Move unary ops before selects, if both True & False values are consts

FCMP opimization (#556)

[x] (df624b07) ConstProp: FCMP Optimization
[x] (acf426fb) ConstProp: Fix checks on FCMP

Memop imm pooling (#557)

[x] (958dd617) ConstProp: LDR imm pooling

LSE (#544)

[x] (8b175880) LSE: Also optimize away partial writes

Improve SS/SD frontend (#543)

[x] (3b6bba64) OpDisp: Avoid some pointless vinserts
[x] (d155c08f) x86tables: Mark some SD sse opcodes as 64-bit to reduce overhead

Improve REP frontend (#542)

[x] (d7ec1ea8) OpDisp: Read DF outside REP loops as it won't change

backend opt (#551)

[x] (e9bc02ac) arm64: Optimize VIns* to only use temp reg if it must
[x] (ebb34e1f) arm64: Avoid needless movs in OP_VMOV

Dead FPR Store Elim (#545)

[x] (6d8913ae) IR: Add a rather ieffective DeadFPRStoreElim
[x] (b9653006) IR: DeadFPRStoreElim

IR changes

Sext as a special case of Sbfe (#546)

[x] (9168a1a4) IR: Replace Sext with Sbfe

Move SetFlag logic to IR from backends (#547)

[x] (63e05cdf) IR: Move BFEs to SetFlag, remove from OP_STOREFLAG

FCMP: Move complexity to backend where it is easier to handle it (#548)

[x] (f98cd0a3) IR: Change FCMP to return x86-style flags

Already merged to master

[x] (ce2dda80) OpDisp: Flag calculation bypass
[x] (74254a1a) IR: Load/Store mem with two args
[x] (2ff55a04) Add arm64 backend, disable for TSO
[x] (f0cedab3) Fix unified mem, Add interpreter
[x] (9fdd2a5b) LoadMem/StoreMem: Add OffsetType, OffsetScale
[x] (552f5db0) TSO: Only emit if enabled
[x] (703a025a) IR/Arm64: Inline Constants for MemoryOps
[x] (b82c8b9e) ConstProp: Fix inline consts for load/storemem
[x] (0e329f4e) Load/StoreMem: Cleanup Complex address gen
[x] (d7191891) Load/Store: Use ->Addr and ->Offset in Interpreter
[x] (f3817991) IR: Swith MemoryOffsetType to a type from uint8_t
[x] (c5316b1f) IR Loader: Add support for MemOffsetType, Fix %Invalid
[x] (1073d405) IR Tests: Update LoadMem to use extended syntax
[x] (4ec41088) Typofix: UnwarpNode -> UnwrapNode
[x] (92eb1405) OpDisp: Rename TSO helper to _Load/_StoreAutoTSO
[x] (e228295f) Jit: Remove conditional TSO on backends

Dead flags (#550)

[x] (d90930aa) DFSE: Use uint64_t for flag bitmask, we use up to bit 47 w/ x87 flags
[x] (4ec1ac7b) OpDisp, IR: Extend InvalidateFlags to invalidate a bitfield, invalidate PF when not emulated

Debug data (#552)

[x] (b8fd84ab) DebugData: Add Subblock list, populate it from arm64 backend

Misc (unused)

[ ] (7095c7eb) IR: Add OP_WEAKREF

IR: Support floats in conditionals (#549)

[x] (4f55b425) OpDisp, IR, Backends: Add Float compares to Select, CondJump, COND_F*

ConstProp: Masking elimination (#555)

[x] (10983521) ConstProp: Add RemoveUselessMasking, implement for a few ops
[x] (d906e0bd) ConstProp: Documented different sub-passes, remove leftover continue; statements
[x] (f3847074) ConstProp: Special case BFE, AND in RemoveUselessMasking sub-pass
[x] (8d322751) ConstProp: SBFE useless masking elim
[x] (acde1468) ConstProp: Elim Bfe on ops that zext by default
[x] (ed3fea30) ConstProp: Remove some BFEs that do nothing
[x] (1961f9f0) ConstProp: Fix IsBfeAlreadyDone
[x] (7315a672) ConstProp: Add some VMOV elimination
[x] (4a7dcc64) ConstProp: More VMOV Elim

SRA (Draft: #524)

[x] (2e2585a8) SRA: Initial scaffolding
[x] (7328a168) SRA: Some changes on RA
[x] (aae70dda) SRA: initial x86 backend impl
[x] (86e97930) SRA: Fix should interpret check
[x] (035ea518) SRA: Somewhat working impl
[x] (c2c9a0fb) SRA: Basic aarch64 impl
[x] (83ee6b6c) SRA: MAp 16 regs, arm64 only
[x] (df978135) SRA: Bytemark working with 10 regs for arm64
[x] (45e194da) SRA: Towards load-mov elim
[x] (766b8834) SRA: load-mov elim. allocation fix
[x] (6cec3778) SRA: load-mov elim, RA integration, runs bytemark
[x] (d5e902ed) SRA: Fix thunks/callback + SRA spilling
[x] (fd7ac1dc) SRA: Rework RA-side, add pre-writting, add debug prints
[x] (bc219525) SRA: RematCost of 1 makes RA think value is OP_CONST
[x] (5d4d697c) SRA: Experimental use of caller saved regs for SRA16
[x] (b9aea257) RA: Don't SRA-alias global values
[x] (e4719d01) arm64: Spill in the correct order in OP_THUNK
[x] (ffbadfcf) SRA: WIP: Maybe some bugs fixed, more asserty
[x] (79b987f1) SRA: WIP: More hotfix work
[x] (306b7dd9) SRA/RA: Add Interference only on related classes
[x] (1fea040c) SRA/RA: Re-enable pre-writes, remove spill prints
[x] (e84f97b2) SRA: Cleanup some helpers
[x] (2e9b7dcb) SRA: Correctly track which StoreRegister clears Prewritten
[x] (befc3bd1) arm64: Make ror zext when size is 1 or 2
[x] (335566d5) SRA/RA: Set re-loaded aliased spans to written as we can't track future writes to them
[x] (3661b9ce) SRA: Wip cleanup
[x] (3a9cf77e) SRA: Somewhat cleaned up
[x] (c67ef51e) SRA: Float support arm64, disabled alias/prewrite
[x] (384b02be) SRA: Enable alias/prewrite for GPR only
[x] (ee9f83fb) SRA: Multiclass alias & prewrite
[x] (d237fcfb) SRA: Fix SRA::IsStaticAllocFpr
[x] (a97fa351) SRA: Cleanups
[x] (d67efbb4) SRA: Allow some partial register aliasing
[x] (32af1d21) RA/SRA: Better pre-write invalidation detection
[x] (9d8cd60c) arm64: Fix OP_LOADREG w/ simd w/ nonzero offset for 4,8

Unsynchronized RIP updates (#637)

L1C (#496)

To rework

Block linking w/ backpatching (#497)

Dec 03 '20 15:12 skmp

Trying out projects to keep track of this -> https://github.com/FEX-Emu/FEX/projects/2 <-

Dec 10 '20 16:12 skmp

Perf: <<ByteMark to 80% native>> Roadmap

<<ByteMark to 80% native>> Roadmap

Cleanup changes in skmp/optihacks-3.

Lightweight guest branching & dispatch

Reduce Lookup overhead

Basic Structure

Indirect Code Lookup

Block linking

Basic Structure

Block ending for blocks that exit with CALL_DIRECT, JUMP_DIRECT

Block ending for blocks that exit with RET, CALL_INDIRECT, JUMP_INDIRECT

PC-recovery

Block link metadata

Static Register Allocation

Multiblock

Multiple Entry Points

PHI nodes (possibly not needed to meet goals)

MB-DCLSE (possibly not needed to meet goals)

Address important pathological code gen

Merges from skmp/optihacks-4

Merged

Done in IR improvements (#484)

Done in profile improvements (#485)

Done in #464

Done in #480

Done in IR/Select (#488)

Done in Frontend/setcc/cmov (#487. #491)

Done in #481

Done in backend/arm64 (#483)

Todo in Select/Inline Consts (#488)

Todo in RCLSE improvements (#482)

Todo in enhanced CondJump branching (#490)

Todo in #479

Todo in ABIOpts - invalidate-flags (#486)

Todo in ABIOpts - skip-pf (#486)

Todo in Complex Load/Store Address Generation (#473)

Todo in Frontend/CC bypass (#489)

Blocked on RA fixes

Todo in IR Constant Pooling (#339)

Pending RFC feedback

Todo in BlockCache / L1C (#496)

Todo in Linking (#497)

TODO

Merges from skmp/optihacks-6

Merged

Constprop

Mul to Lshl (#553)

CMT around Select (#554)

FCMP opimization (#556)

Memop imm pooling (#557)

LSE (#544)

Improve SS/SD frontend (#543)

Improve REP frontend (#542)

backend opt (#551)

Dead FPR Store Elim (#545)

IR changes

Sext as a special case of Sbfe (#546)

Move SetFlag logic to IR from backends (#547)

FCMP: Move complexity to backend where it is easier to handle it (#548)

Already merged to master

Dead flags (#550)

Debug data (#552)

Misc (unused)

IR: Support floats in conditionals (#549)

ConstProp: Masking elimination (#555)

SRA (Draft: #524)

Unsynchronized RIP updates (#637)

L1C (#496)

To rework

Block linking w/ backpatching (#497)

Cleanup changes in `skmp/optihacks-3`.

Merges from `skmp/optihacks-4`

Merges from `skmp/optihacks-6`