sdk-ng Code generation bug with branch delay slots on ARC

Code generation bug with branch delay slots on ARC

Open andyross opened this issue 1 year ago • 2 comments

(Cut/pasted from https://github.com/zephyrproject-rtos/zephyr/issues/54720)

As of recent commits, qemu_arc_em is failing in the tests/kernel/mem_protect/syscalls test with:

START - test_syscall_torture
Running syscall torture test with 4 threads on 1 cpu(s)
E: ***** Exception vector: 0x2, cause code: 0x1, parameter 0x0
E: Address 0x800014a4
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x804009f0 (unknown)
E: Halting system

The proximate cause was (hilariously) that the patch count since the last release candidate had reached 100. This caused the version string printed by the boot banner to be one byte longer and exposed the bug. (I actually got a test rig created where two Zephyr binaries that differed ONLY in whether the last byte of a fake banner string was a "x" or a newline would differ in crash behavior).

But as it turns out that's all just timing interaction. The real problem happens in the heap code, and is a compiler bug. @ruuddw pointed out that the fault (at 0x800014a4) is actually flagging an illegal instruction in a branch delay slot. And indeed, the disassembly below shows that the faulting instruction is a ENTER_S (one of the forbidden instructions), and that the instruction immediately preceding is indeed an unconditional branch!

This is a clear optimizer bug. The generated code can't be allowed to emit a linker section that ends on an instruction with a branch delay slot, because it can't control the instruction the linker will place next. And indeed, stuffing a NOP instruction at the end of the function fixes the symptom

8000146c <merge_chunks>:
{
8000146c:       c2e8                    enter_s [r13-r16,blink]
8000146e:       4748                    mov_s   r15,r2
80001470:       4508                    mov_s   r13,r0
80001472:       4030                    mov_s   r16,r1
        chunksz_t newsz = chunk_size(h, lc) + chunk_size(h, rc);
80001474:       0e1e ffcf               bl      -484    ;80001290 <chunk_size>
80001478:       41e1                    mov_s   r1,r15
8000147a:       4608                    mov_s   r14,r0
8000147c:       40a1                    mov_s   r0,r13
8000147e:       0e16 ffcf               bl      -492    ;80001290 <chunk_size>
        set_chunk_size(h, lc, newsz);
80001482:       4102                    mov_s   r1,r16
        chunksz_t newsz = chunk_size(h, lc) + chunk_size(h, rc);
80001484:       661e                    add_s   r14,r14,r0
        set_chunk_size(h, lc, newsz);
80001486:       40a1                    mov_s   r0,r13
80001488:       42c1                    mov_s   r2,r14
8000148a:       0e4a ffcf               bl      -440    ;800012d0 <set_chunk_siz
e>
        return c + chunk_size(h, c);
8000148e:       41e1                    mov_s   r1,r15
80001490:       40a1                    mov_s   r0,r13
80001492:       0e02 ffcf               bl      -512    ;80001290 <chunk_size>
        chunk_set(h, c, LEFT_SIZE, size);
80001496:       43c1                    mov_s   r3,r14
80001498:       6719                    add_s   r1,r15,r0
8000149a:       704c                    mov_s   r2,0
8000149c:       40a1                    mov_s   r0,r13
}
8000149e:       c2c8                    leave_s [r13-r16,blink]
800014a0:       05d1 ffcf               b       -560    ;80001270 <chunk_set>

800014a4 <free_list_add>:
{
800014a4:       c2e8                    enter_s [r13-r16,blink]
        return big_heap_chunks(h->end_chunk);
800014a6:       80e2                    ld_s    r15,[r0,0x8]
800014a8:       4628                    mov_s   r14,r1
800014aa:       4508                    mov_s   r13,r0

Feb 10 '23 18:02 andyross

Note that as pointed out be @abrodkin the branch instruction at 0x800014a0 is in fact the non-delay-slot variant that does not execute the next instruction on taken branches (the other variant would disassemble as "b.d").

And if that's the case, then this bug is actually in qemu and not gcc. Though I guess that's still against the toolchain so it should stay open.

[1] Apparently ARC has... both? Not sure how that provides value since the whole idea behind delay slots is that they mean you don't need interlocks in the pipeline. Now ARC needs the pipeline handling and has extra instructions to decode?

Feb 10 '23 19:02 andyross

[1] best of both worlds: performance and code density in case the cycle for the branch cannot be used with a branch delay slot instruction. Saves a nop instruction.

Feb 10 '23 20:02 ruuddw

sdk-ng sdk-ng copied to clipboard

Code generation bug with branch delay slots on ARC

sdk-ng
sdk-ng copied to clipboard