Use dynamic jump instructions when the offset is too large to fit in the immediate of static jumps
This is a bit more complicated than it seems because the register allocator does not handle dynamic jumps. In fact, I couldn't find anything about that in the literature. Dynamic jumps make "liveness analysis" really difficult to perform because we just don't know, at compile time, how the control flow graph looks like and therefore we can't build an interference graph. The good news is that we can introduce dynamic jumps after register allocation:
- Modify the assembly pre-allocation to include "organizational ops" (just like we do before realizing jump labels) that look like
jnzi,jnei, andjibut allow au64jump offset. - Run the register allocator as usual. Some minor changes will be required because the organizational ops above won't be like regular virtual ops anymore
- Replace jumps with either static jumps or dynamic jumps depending on how large the offsets are. We will need extra logic before dynamic jumps to actually store the offsets into registers.
The above works and will be more efficiently than any attempt to do register allocation with dynamic jumps.
An alternative here which I now think might be better is to introduce jmp/jne before allocation BUT add some metadata to it that tells us what's actually in the register and use that information in the allocator. That works for us because we always know at compile-time what's in the register (not actually always; only in the use case above where we're only using dynamic jumps because the offset is larger than 18 bits). The benefit of this approach is that we won't have to worry about the extra logic we would have to add after register allocation in the cases where we need jmp/jne. Adding logic after allocation is not that trivial actually (what registers should that logic uses?)
OK, so the allocator just treats the dynamic jumps as static jumps based on the metadata and preserves the destination register and we're fine?
OK, so the allocator just treats the dynamic jumps as static jumps based on the metadata and preserves the destination register and we're fine?
Yeah, at least that's the plan 😄
@otrho is this in progress with your mega PR? Or would this be a follow-on?
I suppose this is deprioritized after your calls PR goes in anyway, since bytecode size will be much smaller.
I suppose this is deprioritized after your calls PR goes in anyway, since bytecode size will be much smaller.
Hopefully, but this is still a problem and needs addressing. Since jump addresses are absolute then all control flow involving labels residing in the space beyond 18bits will need to be dynamic jumps. (12bits for JNEI although we could maybe fudge that by using EQ and JNZI pairings in that case.)
It will be less of a problem after the calls PR because we're switching the order of register allocation and address realisation. So dynamic jumps can be to labels and don't need the metadata approach @mohammadfawaz talks about above. It'll still be tricky working out which labels need to be dynamic targets or not but the register allocator isn't really a factor any more (since it'll be fine to use the 'scratch' register for the offset).
I have spent time today trying to make a test under sway/test/src/ir_generation and it's not... easy.
To make a binary which is big enough for absolute addresses to overflow 18bit or 24bit offsets in Sway requires padding out the code somehow. My assumption is the only way to do this is literally with a whole lot of source, and the most direct way is with an ASM block full of NOOPs.
This assumption may be incorrect. If you have a better idea let me know. Another technique might be to create a large function and to call it hundreds of times and force it to be inlined, to blow up the caller, except we have no way to force inlining presently.
I tried to make a Sway source file with an ASM block with thousands of NOOPs. It ended up being in the order of megabytes in size and took minutes to compile, even with --release. This isn't great for our test suite -- we don't want a single test to sit there for minutes, especially when all it's doing is trying to parse and compile thousands of NOOPs.
I had another idea, and implemented it to see if it works. I created a BOMB instruction which takes an immediate count and during compilation creates a block of NOOPs for us.
script;
fn main() -> u64 {
asm(res) {
bomb i12;
movi res i42;
res: u64
}
}
: forc build --print-finalized-asm
Finished dev [unoptimized + debuginfo] target(s) in 0.12s
Running `/usr/home/toby/prj/sway/target/debug/forc build --print-finalized-asm`
Compiled library "core".
Compiled library "std".
;; --- FINAL PROGRAM ---
.program:
ji i4
noop
DATA_SECTION_OFFSET[0..32]
DATA_SECTION_OFFSET[32..64]
lw $ds $is 1
add $$ds $$ds $is
bomb i12 ; asm block
movi $r0 i42 ; asm block
ret $r0
.data:
Compiled script "bomb".
Bytecode size is 80 bytes.
Script bytecode hash: 0xb047e9f3ed44fc7016ca151b1d314b064c47c5bc10d93da990d455cbfa3f4bc0
: forc parse-bytecode out/debug/bomb.bin
Finished dev [unoptimized + debuginfo] target(s) in 0.12s
Running `/usr/home/toby/prj/sway/target/debug/forc parse-bytecode out/debug/bomb.bin`
half-word byte op raw notes
0 0 JI(4) 90 00 00 04 jump to byte 16
1 4 NOOP 47 00 00 00
2 8 Undefined 00 00 00 00 data section offset lo (0)
3 12 Undefined 00 00 00 24 data section offset hi (36)
4 16 LW(63, 12, 1) 5d fc c0 01
5 20 ADD(63, 63, 12) 10 ff f3 00
6 24 NOOP 47 00 00 00
7 28 NOOP 47 00 00 00
8 32 NOOP 47 00 00 00
9 36 NOOP 47 00 00 00
10 40 NOOP 47 00 00 00
11 44 NOOP 47 00 00 00
12 48 NOOP 47 00 00 00
13 52 NOOP 47 00 00 00
14 56 NOOP 47 00 00 00
15 60 NOOP 47 00 00 00
16 64 NOOP 47 00 00 00
17 68 NOOP 47 00 00 00
18 72 MOVI(16, 42) 72 40 00 2a
19 76 RET(16) 24 40 00 00
It should work for building a giant test binary in this case. The only issue is is creating an instruction in the compiler to satisfy a single test justified? The impact to the compiler source is actually fairly minimal, but conceptually it feels a bit heavy handed. At the same time I don't know of a better solution.
Would loop unrolling help here?
Yeah, that would be another option, although we don't have loop unrolling yet :slightly_smiling_face: and this issue is on the current milestone while unrolling doesn't even get a proper mention in our optimisations roadmap.
I'm going to go ahead with this opcode (except I'll call it blob instead of bomb) until we can replace it with something better. I'll make an issue for that.
Also, I think it might be worth allowing individual e2e tests to specify what IR optimisations they want/need turned on or if they want any at all. It will be a bit of a pain if we one day introduce an optimisation which removes code we're trying to test and there's no way to stop it.
I agree with all the above. Certainly something like -opt-args -simplify-cfg could be useful.