New Register Allocator
Migrating from SSA to SVD/local-SSA/GVN
Targets:
- Improved Code Gen (#87)
- Vastly reduced complexity
- Linear, parellisable allocation (#85)
- Native support for flexible SRA (#608, #609, #610, #611)
- Support context reconstruction and midway exit and multiple entry points (#607, skmp/multiple-entry-points)
- https://github.com/FEX-Emu/FEX/issues/1717
- https://github.com/FEX-Emu/FEX/issues/1655
Wishes
- Better partial allocations
- Native support for separate vregs and pregs
- #1791
skmp — 06/28/2022
(from #coding)
so for the new reg alloc
the idea is to separate values from live ranges and define "live rangelets" which are spans between uses, essentially places where the value needs to survive essentially, getting out of SSA form the idea is to rewrite the SSA arguments to non-ssa arguments each ssa argument is 32 bits, we can pack a lot of information there for regs (possibly up to 4 registers) which is the max amount under aarch64 (LD4 ops, and those are not actually distinct anyway)
this way, we can handle GPRPairs nicely and natively and we remove one indirect lookup per emission now, on the gen side from SSA, we can keep u32 block[SSA_COUNT]; u32 vregs[SSA_ARG_COUNT] and the span start/end logic for simplicity, we can store it in the current RADATA but after allocation vregs can be copied directly to the IR and IR data can become DebugData for values
now, the nice thing is that the source is SSA so we are tracking values to span, which means no reuse of variables, which enables for distinct spill slots so spills can be transformed to set<SSAid> spillslots; and to spill, we do index = spillslots.insert(value); use sp + index * 16
we can also track the spill size for more optimal stack allocation spill offsets need to be filled later on, in the last pass, so that is easy the rangelets start/stop with existing range start/stop we can also supply rangelet constrains though we might not always be able to satisfy them without extra logic this can be used to implement SRA as well as x86's 2-way addressing
now, on the allocation front we need a better way to handle global values my idea was to introduce another abstraction "phy slots" that are essentially the imports/exports from each block each phy slot has the same id as the ssa that it was created from
when iterating over blocks, we can create rangelets for these opcodes, which start / end around the block (BLOCK_START/END) ops those rangelets will be spilled like all the others the only constrains is that based on the CFG, phys all need to point to the same physical location and, that they are allocated after all local values then, the phi opcode can be modified to generate the right phy mappings and that is all
the main advantage is: much, much simpler logic
I think it will end up being half the size of the current RA, and easier to reason about easily supports phis, constrains, and SRA, and in a unified manner doesn't have to restart after spills as spilling is done in rangelets, not values doesn't need the IR to be compacted and will run in linear time, from nlogn time
There's a lot further thinking on this, braindump from today
What is it all about
We want fully separated "names" from "values". A "name" is any storage location.
Names only can be "global" (registers, memory) they can be volatile or preserved over operations.
A "value" is a result of a tree of operations.
Values can be anonymous, when they have a distinct but unspecified name.
Interworking
Currently, we follow a static inter fragment abi, though it would be interesting to follow a more flexible model in the future. This means, that upon fragment exit the architectural state must reside at predetermined names.
Recovering state mid-execution is also quite important. While we don't need to recover state at every instruction boundary, we do need to do so in instructions that generate exceptions (readm, writem, div, into; floating points, too).
In ir these can be presented as 'branches', with Next and Fault paths.
Recovery can be up to the last synchronizing instruction, usually a writem. There are various models that can be applied there, divided across general vs re-entrant flows and sealed vs meta operation leakage. Recovery trampolines can be either pre-compiled in special sections, materialized on demand from source, interpreted, etc.
Similarly, asynchronous interruption can be of variable granularity, at the expense of meta data leakage through PC quantisation. My guestimate is that loop edge granularity should be enough.
Advanced optimisations wishlist
- Linear Separation of Orthogonal Addressability Domains: Each memory operation can be written as belonging to a single or multiple linear addressability domains, as expressed by the number of base registers during address calculation. We can use this to infer memory operation interdependence, and reduce value storage amplification. Due to x86's memory model, this will need to be tuned though heuristics and compatibility modes.
- Deglobalisation of local memory: Stack could be assumed to be per-thread
- Single Value Naming: We only need to keep a single copy of a value to guarantee forward progress.
- [Relaxed] Name Aliasing: We can choose to spill at the 'most convenient' location. For local memory and re-entrant flows this can be substantially relaxed.
Handling Constrains
For each instruction we need to know
- Any allocation constraints between arguments, results
- Any register volatility
Handling global allocation
interblock allocation will happen only through names.