Kernel compiled with GCC13 exhibits `Unexpected kernel BRK exception at EL1` kernel panic with reference to `dwc_otg_hcd_init`
Describe the bug
As reported in the RaspberryMatic project (https://github.com/jens-maus/RaspberryMatic/issues/2780), the latest 6.6.31 (stable_20240529) rpi kernel seems to run into a kernel panic with Unexpected kernel BRK exception at EL1 and PC pointing to dwc_otg_hcd_init when the kernel is compiled with GCC v13.
Switching to dwc2 seems to solve the issue as well as downgrading GCC to version 12.3.0 or lower.
The following screenshot demonstrates the kernle panic:
Steps to reproduce the behaviour
The affected nightly snapshot of RaspberryMatic can be downloaded here: https://github.com/jens-maus/RaspberryMatic/releases/download/snapshots/RaspberryMatic-3.77.2.20240618-a0db30-rpi3.zip
When flashed on a sd card the kernel crashes immediately after U-Boot outputting Starting kernel... resulting in an endless loop of U-Boot booting the system over and over again.
Please note that you would have to modify cmdline.txt to contain console=tty1 loglevel=10 to actually see the kernel bootup and crash.
Device (s)
Raspberry Pi 3 Mod. B+
System
OS: RaspberryMatic 3.77.2.20240618 RPI Firmware: May 24 2024 15:31:28 (version 4942b7633c0ff1af1ee95a51a33b56a9dae47529 (clean) (release) (start)) kernel: Linux version 6.6.31 (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 3.7 5.7.20240601-33-g0864b00c9-dirty) 13.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT Tue Jun 18 20:54:49 C EST 2024
Logs
see Screenshot
Additional context
Please note details in third-party ticket here: https://github.com/jens-maus/RaspberryMatic/issues/2780
Can you build the kernel with debug info, then post an annotated disassembly of the function in question? Comparing gcc 12 vs 13 would also be useful.
If getting debug info is a problem, I'm happy to have a go without...
If you could advice which debug infos (kernel options) you would like to be enabled to have more verbose output I would be happy to provide it. Apart from that my references sd card image with the gcc13 compiled kernel happily crashes as I outlined above. So reproducing this issue by using this sd card image should be quiet easy IMHO.
I think you need to unset CONFIG_DEBUG_INFO_NONE=y and set CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y.
If it's easier, just attach drivers/usb/host/dwc_otg/dwc_otg_hcd.o from your build tree.
I think you need to unset CONFIG_DEBUG_INFO_NONE=y and set CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y.
I am unfortunately not near my Pi3 setup so I can currently no try to reproduce the issue with enabled debug information. I will try to do that as soon as I am back to my Pi3 environment.
If it's easier, just attach
drivers/usb/host/dwc_otg/dwc_otg_hcd.ofrom your build tree.
I just did a recompile. here is the object file in a tar.gz archive: dwc_otg_hcd.o.tar.gz. Hopefully it proofs usable until I can rebuild everything with enabled kernel debug information.
@pelwell Ok, I found someone who could re-test the crash with the re-compiled kernel with CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y enabled and CONFIG_DEBUG_INFO_NONE unset. See here the new/fresh crash screenshot:
However, I don't see any more detailed debug output in this image compared to my original one. Could be that my build environment (buildroot) seems to strip some debug information and I don't know yet how to prevent it from stripping it. Nevertheless, potentially this already helps you in trying to identifying the root cause why with a GCC13 compiled rpii kernel it crashes in dwc_otg_hcd_inithere.
That's good progress. However, the debug information is to allow us to correlate the assembly language in the object file with the C source code. Could you upload the newly compiled kernel somewhere we can get at it?
Just the vmlinuz/Image file or what do you need? I could also upload the whole sdcard image of the debug build if you want to reproduce the crash yourself with a pi3!
Let's start with just the Image/vmlinux.
Sorry - I think it's the vmlinux that we need. Image has had the debug info stripped.
Sorry - I think it's the
vmlinuxthat we need.Imagehas had the debug info stripped.
Ok, then here the next try. However, as the attachment is about 140MB large, please download it from my nextcloud: [.. deleted ..]
Let me know when you have it so that I can remove it again.
Thanks - I've got it.
I'm struggling to understand exactly what is happening here, but here's what I think I know:
- The
brk 0x3e8(1000) instructions are breakpoints to stop execution when in a debugger. - The compiler seems to be inserting them at critical points as a kind of assertion.
- I think your build might have KASAN enabled, which would explain some of the additional checks that have been inserted.
- This the disassembly of the area around the
brkin question:
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1042
ffffffc080a7eb2c: 2a1303e2 mov w2, w19
ffffffc080a7eb30: 52800001 mov w1, #0x0 // #0
ffffffc080a7eb34: 9400392f bl ffffffc080a8cff0 <DWC_MEMSET>
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1045
ffffffc080a7eb38: f941d680 ldr x0, [x20, #936]
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1048 (discriminator 1)
ffffffc080a7eb3c: 7100031f cmp w24, #0x0
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1045 (discriminator 1)
ffffffc080a7eb40: b900001f str wzr, [x0]
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1048 (discriminator 1)
ffffffc080a7eb44: 5400004d b.le ffffffc080a7eb4c <dwc_otg_hcd_init+0x1dc>
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1049
ffffffc080a7eb48: d4207d00 brk #0x3e8
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1051
ffffffc080a7eb4c: f941d693 ldr x19, [x20, #936]
ffffffc080a7eb50: 52800201 mov w1, #0x10 // #16
ffffffc080a7eb54: aa1703e0 mov x0, x23
ffffffc080a7eb58: 91018262 add x2, x19, #0x60
ffffffc080a7eb5c: 94003bd1 bl ffffffc080a8daa0 <__DWC_DMA_ALLOC_ATOMIC>
- The equivalent bit of C is:
1042: DWC_MEMSET(hcd->fiq_state, 0, (sizeof(struct fiq_state) + (sizeof(struct fiq_channel_state) * num_channels)));
1043:
1044: #ifdef CONFIG_ARM64
1045: spin_lock_init(&hcd->fiq_state->lock);
1047: #endif
1048:
1049: for (i = 0; i < num_channels; i++) {
1050: hcd->fiq_state->channel[i].fsm = FIQ_PASSTHROUGH;
1051: }
1052: hcd->fiq_state->dummy_send = DWC_DMA_ALLOC_ATOMIC(dev, 16,
&hcd->fiq_state->dummy_send_dma);
- From reading other parts of the disassembly I'm and matching up with the line numbers I'm pretty certain that
w24(the lower 32-bits ofx24) isnum_channels. - The instruction at
ffffffc080a7eb44is a conditional branch over thebrk, which we would want to take. The condition flags that are tested would be those set by thecmpinstruction atffffffc080a7eb3c(thestrinstruction between them doesn't change the condition flags - it's thespin_lock_initon line 1045). - It seems reasonable to me that the compiler, possibly instructed by KASAN, might have been able to insert a check for
num_channelsbeing negative, because that would lead to an overflow pretty quickly.
Here's where it gets weird...
- I can't find any evidence of the initialisation of the .fsm fields to FIQ_PASSTHROUGH. However, as the value of FIQ_PASSTHROUGH is 0, the DWC_MEMSET(..., 0, ...) above has already done that.
- If you look at the value of
x24in your crash dump you'll see it is 8 (which is the correct value, and not negative), so why are we hitting thebrk? - The
b.leis branch-if-less-than-or-equal-to. Reading it in conjunction withcmp w24, #0x0(with which it forms a pair), you get branch-if-w24-is-less-than-or-equal-to-0. Since w24 is 8, it's no surprise that the branch wasn't taken.
Why on earth is 8 an illegal value, but 0 (or less!) is acceptable? How many channels are there in the fiq_state struct?
struct fiq_state {
fiq_lock_t lock;
...
struct fiq_channel_state channel[0];
};
Oh, I see. The compiler/KASAN is objecting to the use of a zero-element array as a way of constructing a variable-length array.
@P33M I think we should drop the FIQ_PASSTHROUGH loop here, possible replacing it with an assertion that FIQ_PASSTHROUGH is 0.
And while you're here, how functional is dwc_otg on arm64 now?
Of course, it's possible that the same kind of check is inserted elsewhere in code that can't simply be deleted. It may be that we have to declare the structure to be a sensible maximum size, then use something like offsetof(..->channel[num_channels]) to allocate just enough space for the used channels.
This struct can be statically sized - the hardware supports a maximum of 16 channels. BCM283x is configured for 8. However zero-length variable arrays are used elsewhere in the driver -
dwc_otg_hcd.h:
struct dwc_otg_hcd_urb
dwc_otg_fiq_fsm.h:
struct fiq_dma_blob
struct fiq_state
Those are likely to end up with similar assert-style panics.
Shouldn't we be using flexible length arrays ([]) rather than zero length ('[0]')?
https://www.phoronix.com/news/Linux-5.18-Flexible-Arrays
@P33M @pelwell Thanks for the PR #6252. Please note that I have tested it and commented on it accordingly as it mainly LGTM (cf. https://github.com/raspberrypi/linux/pull/6252#issuecomment-2213245724).
Sounds like this is resolved, so closing. Let us know if anything else is needed and we can reopen.