sdk-ng
sdk-ng copied to clipboard
Code generation bug with branch delay slots on ARC
(Cut/pasted from https://github.com/zephyrproject-rtos/zephyr/issues/54720)
As of recent commits, qemu_arc_em is failing in the tests/kernel/mem_protect/syscalls test with:
START - test_syscall_torture
Running syscall torture test with 4 threads on 1 cpu(s)
E: ***** Exception vector: 0x2, cause code: 0x1, parameter 0x0
E: Address 0x800014a4
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x804009f0 (unknown)
E: Halting system
The proximate cause was (hilariously) that the patch count since the last release candidate had reached 100. This caused the version string printed by the boot banner to be one byte longer and exposed the bug. (I actually got a test rig created where two Zephyr binaries that differed ONLY in whether the last byte of a fake banner string was a "x" or a newline would differ in crash behavior).
But as it turns out that's all just timing interaction. The real problem happens in the heap code, and is a compiler bug. @ruuddw pointed out that the fault (at 0x800014a4) is actually flagging an illegal instruction in a branch delay slot. And indeed, the disassembly below shows that the faulting instruction is a ENTER_S (one of the forbidden instructions), and that the instruction immediately preceding is indeed an unconditional branch!
This is a clear optimizer bug. The generated code can't be allowed to emit a linker section that ends on an instruction with a branch delay slot, because it can't control the instruction the linker will place next. And indeed, stuffing a NOP instruction at the end of the function fixes the symptom
8000146c <merge_chunks>:
{
8000146c: c2e8 enter_s [r13-r16,blink]
8000146e: 4748 mov_s r15,r2
80001470: 4508 mov_s r13,r0
80001472: 4030 mov_s r16,r1
chunksz_t newsz = chunk_size(h, lc) + chunk_size(h, rc);
80001474: 0e1e ffcf bl -484 ;80001290 <chunk_size>
80001478: 41e1 mov_s r1,r15
8000147a: 4608 mov_s r14,r0
8000147c: 40a1 mov_s r0,r13
8000147e: 0e16 ffcf bl -492 ;80001290 <chunk_size>
set_chunk_size(h, lc, newsz);
80001482: 4102 mov_s r1,r16
chunksz_t newsz = chunk_size(h, lc) + chunk_size(h, rc);
80001484: 661e add_s r14,r14,r0
set_chunk_size(h, lc, newsz);
80001486: 40a1 mov_s r0,r13
80001488: 42c1 mov_s r2,r14
8000148a: 0e4a ffcf bl -440 ;800012d0 <set_chunk_siz
e>
return c + chunk_size(h, c);
8000148e: 41e1 mov_s r1,r15
80001490: 40a1 mov_s r0,r13
80001492: 0e02 ffcf bl -512 ;80001290 <chunk_size>
chunk_set(h, c, LEFT_SIZE, size);
80001496: 43c1 mov_s r3,r14
80001498: 6719 add_s r1,r15,r0
8000149a: 704c mov_s r2,0
8000149c: 40a1 mov_s r0,r13
}
8000149e: c2c8 leave_s [r13-r16,blink]
800014a0: 05d1 ffcf b -560 ;80001270 <chunk_set>
800014a4 <free_list_add>:
{
800014a4: c2e8 enter_s [r13-r16,blink]
return big_heap_chunks(h->end_chunk);
800014a6: 80e2 ld_s r15,[r0,0x8]
800014a8: 4628 mov_s r14,r1
800014aa: 4508 mov_s r13,r0
Note that as pointed out be @abrodkin the branch instruction at 0x800014a0 is in fact the non-delay-slot variant that does not execute the next instruction on taken branches (the other variant would disassemble as "b.d").
And if that's the case, then this bug is actually in qemu and not gcc. Though I guess that's still against the toolchain so it should stay open.
[1] Apparently ARC has... both? Not sure how that provides value since the whole idea behind delay slots is that they mean you don't need interlocks in the pipeline. Now ARC needs the pipeline handling and has extra instructions to decode?
[1] best of both worlds: performance and code density in case the cycle for the branch cannot be used with a branch delay slot instruction. Saves a nop instruction.