sway Use dynamic jump instructions when the offset is too large to fit in the immediate of static jumps

This is a bit more complicated than it seems because the register allocator does not handle dynamic jumps. In fact, I couldn't find anything about that in the literature. Dynamic jumps make "liveness analysis" really difficult to perform because we just don't know, at compile time, how the control flow graph looks like and therefore we can't build an interference graph. The good news is that we can introduce dynamic jumps after register allocation:

Modify the assembly pre-allocation to include "organizational ops" (just like we do before realizing jump labels) that look like jnzi, jnei, and ji but allow a u64 jump offset.
Run the register allocator as usual. Some minor changes will be required because the organizational ops above won't be like regular virtual ops anymore
Replace jumps with either static jumps or dynamic jumps depending on how large the offsets are. We will need extra logic before dynamic jumps to actually store the offsets into registers.

The above works and will be more efficiently than any attempt to do register allocation with dynamic jumps.

Jul 26 '22 14:07 mohammadfawaz

An alternative here which I now think might be better is to introduce jmp/jne before allocation BUT add some metadata to it that tells us what's actually in the register and use that information in the allocator. That works for us because we always know at compile-time what's in the register (not actually always; only in the use case above where we're only using dynamic jumps because the offset is larger than 18 bits). The benefit of this approach is that we won't have to worry about the extra logic we would have to add after register allocation in the cases where we need jmp/jne. Adding logic after allocation is not that trivial actually (what registers should that logic uses?)

Jul 30 '22 01:07 mohammadfawaz

OK, so the allocator just treats the dynamic jumps as static jumps based on the metadata and preserves the destination register and we're fine?

Jul 30 '22 04:07 otrho

OK, so the allocator just treats the dynamic jumps as static jumps based on the metadata and preserves the destination register and we're fine?

Yeah, at least that's the plan 😄

Aug 01 '22 19:08 mohammadfawaz

@otrho is this in progress with your mega PR? Or would this be a follow-on?

I suppose this is deprioritized after your calls PR goes in anyway, since bytecode size will be much smaller.

Sep 30 '22 15:09 sezna

I suppose this is deprioritized after your calls PR goes in anyway, since bytecode size will be much smaller.

Hopefully, but this is still a problem and needs addressing. Since jump addresses are absolute then all control flow involving labels residing in the space beyond 18bits will need to be dynamic jumps. (12bits for JNEI although we could maybe fudge that by using EQ and JNZI pairings in that case.)

It will be less of a problem after the calls PR because we're switching the order of register allocation and address realisation. So dynamic jumps can be to labels and don't need the metadata approach @mohammadfawaz talks about above. It'll still be tricky working out which labels need to be dynamic targets or not but the register allocator isn't really a factor any more (since it'll be fine to use the 'scratch' register for the offset).

Oct 01 '22 00:10 otrho

I have spent time today trying to make a test under sway/test/src/ir_generation and it's not... easy.

To make a binary which is big enough for absolute addresses to overflow 18bit or 24bit offsets in Sway requires padding out the code somehow. My assumption is the only way to do this is literally with a whole lot of source, and the most direct way is with an ASM block full of NOOPs.

This assumption may be incorrect. If you have a better idea let me know. Another technique might be to create a large function and to call it hundreds of times and force it to be inlined, to blow up the caller, except we have no way to force inlining presently.

I tried to make a Sway source file with an ASM block with thousands of NOOPs. It ended up being in the order of megabytes in size and took minutes to compile, even with --release. This isn't great for our test suite -- we don't want a single test to sit there for minutes, especially when all it's doing is trying to parse and compile thousands of NOOPs.

I had another idea, and implemented it to see if it works. I created a BOMB instruction which takes an immediate count and during compilation creates a block of NOOPs for us.

script;

fn main() -> u64 {
    asm(res) {
        bomb i12;
        movi res i42;
        res: u64
    }
}

: forc build --print-finalized-asm
    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `/usr/home/toby/prj/sway/target/debug/forc build --print-finalized-asm`
  Compiled library "core".
  Compiled library "std".
;; --- FINAL PROGRAM ---

.program:
ji   i4
noop
DATA_SECTION_OFFSET[0..32]
DATA_SECTION_OFFSET[32..64]
lw   $ds $is 1
add  $$ds $$ds $is
bomb i12                      ; asm block
movi $r0 i42                  ; asm block
ret  $r0
.data:

  Compiled script "bomb".
  Bytecode size is 80 bytes.
  Script bytecode hash: 0xb047e9f3ed44fc7016ca151b1d314b064c47c5bc10d93da990d455cbfa3f4bc0

: forc parse-bytecode out/debug/bomb.bin
    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `/usr/home/toby/prj/sway/target/debug/forc parse-bytecode out/debug/bomb.bin`

  half-word   byte   op                raw           notes
          0   0      JI(4)             90 00 00 04   jump to byte 16
          1   4      NOOP              47 00 00 00
          2   8      Undefined         00 00 00 00   data section offset lo (0)
          3   12     Undefined         00 00 00 24   data section offset hi (36)
          4   16     LW(63, 12, 1)     5d fc c0 01
          5   20     ADD(63, 63, 12)   10 ff f3 00
          6   24     NOOP              47 00 00 00
          7   28     NOOP              47 00 00 00
          8   32     NOOP              47 00 00 00
          9   36     NOOP              47 00 00 00
         10   40     NOOP              47 00 00 00
         11   44     NOOP              47 00 00 00
         12   48     NOOP              47 00 00 00
         13   52     NOOP              47 00 00 00
         14   56     NOOP              47 00 00 00
         15   60     NOOP              47 00 00 00
         16   64     NOOP              47 00 00 00
         17   68     NOOP              47 00 00 00
         18   72     MOVI(16, 42)      72 40 00 2a
         19   76     RET(16)           24 40 00 00

It should work for building a giant test binary in this case. The only issue is is creating an instruction in the compiler to satisfy a single test justified? The impact to the compiler source is actually fairly minimal, but conceptually it feels a bit heavy handed. At the same time I don't know of a better solution.

Oct 20 '22 07:10 otrho

Would loop unrolling help here?

Oct 20 '22 13:10 mohammadfawaz

Yeah, that would be another option, although we don't have loop unrolling yet :slightly_smiling_face: and this issue is on the current milestone while unrolling doesn't even get a proper mention in our optimisations roadmap.

I'm going to go ahead with this opcode (except I'll call it blob instead of bomb) until we can replace it with something better. I'll make an issue for that.

Also, I think it might be worth allowing individual e2e tests to specify what IR optimisations they want/need turned on or if they want any at all. It will be a bit of a pain if we one day introduce an optimisation which removes code we're trying to test and there's no way to stop it.

Oct 21 '22 01:10 otrho

I agree with all the above. Certainly something like -opt-args -simplify-cfg could be useful.

Oct 21 '22 01:10 mohammadfawaz