linux icon indicating copy to clipboard operation
linux copied to clipboard

Kernel compiled with GCC13 exhibits `Unexpected kernel BRK exception at EL1` kernel panic with reference to `dwc_otg_hcd_init`

Open jens-maus opened this issue 1 year ago • 16 comments

Describe the bug

As reported in the RaspberryMatic project (https://github.com/jens-maus/RaspberryMatic/issues/2780), the latest 6.6.31 (stable_20240529) rpi kernel seems to run into a kernel panic with Unexpected kernel BRK exception at EL1 and PC pointing to dwc_otg_hcd_init when the kernel is compiled with GCC v13.

Switching to dwc2 seems to solve the issue as well as downgrading GCC to version 12.3.0 or lower.

The following screenshot demonstrates the kernle panic:

BFDEF11E-8DBE-4686-8774-DF790F1CF040_1_105_c

Steps to reproduce the behaviour

The affected nightly snapshot of RaspberryMatic can be downloaded here: https://github.com/jens-maus/RaspberryMatic/releases/download/snapshots/RaspberryMatic-3.77.2.20240618-a0db30-rpi3.zip

When flashed on a sd card the kernel crashes immediately after U-Boot outputting Starting kernel... resulting in an endless loop of U-Boot booting the system over and over again.

Please note that you would have to modify cmdline.txt to contain console=tty1 loglevel=10 to actually see the kernel bootup and crash.

Device (s)

Raspberry Pi 3 Mod. B+

System

OS: RaspberryMatic 3.77.2.20240618 RPI Firmware: May 24 2024 15:31:28 (version 4942b7633c0ff1af1ee95a51a33b56a9dae47529 (clean) (release) (start)) kernel: Linux version 6.6.31 (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 3.7 5.7.20240601-33-g0864b00c9-dirty) 13.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT Tue Jun 18 20:54:49 C EST 2024

Logs

see Screenshot

Additional context

Please note details in third-party ticket here: https://github.com/jens-maus/RaspberryMatic/issues/2780

jens-maus avatar Jun 18 '24 21:06 jens-maus

Can you build the kernel with debug info, then post an annotated disassembly of the function in question? Comparing gcc 12 vs 13 would also be useful.

P33M avatar Jun 24 '24 19:06 P33M

If getting debug info is a problem, I'm happy to have a go without...

pelwell avatar Jun 24 '24 20:06 pelwell

If you could advice which debug infos (kernel options) you would like to be enabled to have more verbose output I would be happy to provide it. Apart from that my references sd card image with the gcc13 compiled kernel happily crashes as I outlined above. So reproducing this issue by using this sd card image should be quiet easy IMHO.

jens-maus avatar Jun 25 '24 07:06 jens-maus

I think you need to unset CONFIG_DEBUG_INFO_NONE=y and set CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y.

pelwell avatar Jun 25 '24 12:06 pelwell

If it's easier, just attach drivers/usb/host/dwc_otg/dwc_otg_hcd.o from your build tree.

pelwell avatar Jun 27 '24 08:06 pelwell

I think you need to unset CONFIG_DEBUG_INFO_NONE=y and set CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y.

I am unfortunately not near my Pi3 setup so I can currently no try to reproduce the issue with enabled debug information. I will try to do that as soon as I am back to my Pi3 environment.

If it's easier, just attach drivers/usb/host/dwc_otg/dwc_otg_hcd.o from your build tree.

I just did a recompile. here is the object file in a tar.gz archive: dwc_otg_hcd.o.tar.gz. Hopefully it proofs usable until I can rebuild everything with enabled kernel debug information.

jens-maus avatar Jun 30 '24 15:06 jens-maus

@pelwell Ok, I found someone who could re-test the crash with the re-compiled kernel with CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y enabled and CONFIG_DEBUG_INFO_NONE unset. See here the new/fresh crash screenshot:

rpi3-crash-20240701

However, I don't see any more detailed debug output in this image compared to my original one. Could be that my build environment (buildroot) seems to strip some debug information and I don't know yet how to prevent it from stripping it. Nevertheless, potentially this already helps you in trying to identifying the root cause why with a GCC13 compiled rpii kernel it crashes in dwc_otg_hcd_inithere.

jens-maus avatar Jul 01 '24 08:07 jens-maus

That's good progress. However, the debug information is to allow us to correlate the assembly language in the object file with the C source code. Could you upload the newly compiled kernel somewhere we can get at it?

pelwell avatar Jul 01 '24 08:07 pelwell

Just the vmlinuz/Image file or what do you need? I could also upload the whole sdcard image of the debug build if you want to reproduce the crash yourself with a pi3!

jens-maus avatar Jul 01 '24 09:07 jens-maus

Let's start with just the Image/vmlinux.

pelwell avatar Jul 01 '24 09:07 pelwell

Let's start with just the Image/vmlinux.

Here you go: Image.tar.gz

jens-maus avatar Jul 01 '24 09:07 jens-maus

Sorry - I think it's the vmlinux that we need. Image has had the debug info stripped.

pelwell avatar Jul 01 '24 09:07 pelwell

Sorry - I think it's the vmlinux that we need. Image has had the debug info stripped.

Ok, then here the next try. However, as the attachment is about 140MB large, please download it from my nextcloud: [.. deleted ..]

Let me know when you have it so that I can remove it again.

jens-maus avatar Jul 01 '24 10:07 jens-maus

Thanks - I've got it.

pelwell avatar Jul 01 '24 10:07 pelwell

I'm struggling to understand exactly what is happening here, but here's what I think I know:

  1. The brk 0x3e8 (1000) instructions are breakpoints to stop execution when in a debugger.
  2. The compiler seems to be inserting them at critical points as a kind of assertion.
  3. I think your build might have KASAN enabled, which would explain some of the additional checks that have been inserted.
  4. This the disassembly of the area around the brk in question:
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1042
ffffffc080a7eb2c:	2a1303e2 	mov	w2, w19
ffffffc080a7eb30:	52800001 	mov	w1, #0x0                   	// #0
ffffffc080a7eb34:	9400392f 	bl	ffffffc080a8cff0 <DWC_MEMSET>
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1045
ffffffc080a7eb38:	f941d680 	ldr	x0, [x20, #936]
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1048 (discriminator 1)
ffffffc080a7eb3c:	7100031f 	cmp	w24, #0x0
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1045 (discriminator 1)
ffffffc080a7eb40:	b900001f 	str	wzr, [x0]
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1048 (discriminator 1)
ffffffc080a7eb44:	5400004d 	b.le	ffffffc080a7eb4c <dwc_otg_hcd_init+0x1dc>
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1049
ffffffc080a7eb48:	d4207d00 	brk	#0x3e8
/home/damato/projekte/linux/RaspberryMatic/build-raspmatic_rpi3/build/linux-custom/drivers/usb/host/dwc_otg/dwc_otg_hcd.c:1051
ffffffc080a7eb4c:	f941d693 	ldr	x19, [x20, #936]
ffffffc080a7eb50:	52800201 	mov	w1, #0x10                  	// #16
ffffffc080a7eb54:	aa1703e0 	mov	x0, x23
ffffffc080a7eb58:	91018262 	add	x2, x19, #0x60
ffffffc080a7eb5c:	94003bd1 	bl	ffffffc080a8daa0 <__DWC_DMA_ALLOC_ATOMIC>
  1. The equivalent bit of C is:
1042: 		DWC_MEMSET(hcd->fiq_state, 0, (sizeof(struct fiq_state) + (sizeof(struct fiq_channel_state) * num_channels)));
1043: 
1044: #ifdef CONFIG_ARM64
1045: 		spin_lock_init(&hcd->fiq_state->lock);
1047: #endif
1048: 
1049: 		for (i = 0; i < num_channels; i++) {
1050: 			hcd->fiq_state->channel[i].fsm = FIQ_PASSTHROUGH;
1051: 		}
1052: 		hcd->fiq_state->dummy_send = DWC_DMA_ALLOC_ATOMIC(dev, 16,
							 &hcd->fiq_state->dummy_send_dma);
  1. From reading other parts of the disassembly I'm and matching up with the line numbers I'm pretty certain that w24 (the lower 32-bits of x24) is num_channels.
  2. The instruction at ffffffc080a7eb44 is a conditional branch over the brk, which we would want to take. The condition flags that are tested would be those set by the cmp instruction at ffffffc080a7eb3c (the str instruction between them doesn't change the condition flags - it's the spin_lock_init on line 1045).
  3. It seems reasonable to me that the compiler, possibly instructed by KASAN, might have been able to insert a check for num_channels being negative, because that would lead to an overflow pretty quickly.

Here's where it gets weird...

  1. I can't find any evidence of the initialisation of the .fsm fields to FIQ_PASSTHROUGH. However, as the value of FIQ_PASSTHROUGH is 0, the DWC_MEMSET(..., 0, ...) above has already done that.
  2. If you look at the value of x24 in your crash dump you'll see it is 8 (which is the correct value, and not negative), so why are we hitting the brk?
  3. The b.le is branch-if-less-than-or-equal-to. Reading it in conjunction with cmp w24, #0x0 (with which it forms a pair), you get branch-if-w24-is-less-than-or-equal-to-0. Since w24 is 8, it's no surprise that the branch wasn't taken.

Why on earth is 8 an illegal value, but 0 (or less!) is acceptable? How many channels are there in the fiq_state struct?

struct fiq_state {
	fiq_lock_t lock;
...
	struct fiq_channel_state channel[0];
};

Oh, I see. The compiler/KASAN is objecting to the use of a zero-element array as a way of constructing a variable-length array.

@P33M I think we should drop the FIQ_PASSTHROUGH loop here, possible replacing it with an assertion that FIQ_PASSTHROUGH is 0.

And while you're here, how functional is dwc_otg on arm64 now?

pelwell avatar Jul 02 '24 20:07 pelwell

Of course, it's possible that the same kind of check is inserted elsewhere in code that can't simply be deleted. It may be that we have to declare the structure to be a sensible maximum size, then use something like offsetof(..->channel[num_channels]) to allocate just enough space for the used channels.

pelwell avatar Jul 02 '24 21:07 pelwell

This struct can be statically sized - the hardware supports a maximum of 16 channels. BCM283x is configured for 8. However zero-length variable arrays are used elsewhere in the driver -

dwc_otg_hcd.h:
struct dwc_otg_hcd_urb

dwc_otg_fiq_fsm.h:
struct fiq_dma_blob
struct fiq_state

Those are likely to end up with similar assert-style panics.

P33M avatar Jul 05 '24 09:07 P33M

Shouldn't we be using flexible length arrays ([]) rather than zero length ('[0]')? https://www.phoronix.com/news/Linux-5.18-Flexible-Arrays

popcornmix avatar Jul 05 '24 11:07 popcornmix

@P33M @pelwell Thanks for the PR #6252. Please note that I have tested it and commented on it accordingly as it mainly LGTM (cf. https://github.com/raspberrypi/linux/pull/6252#issuecomment-2213245724).

jens-maus avatar Jul 08 '24 07:07 jens-maus

Sounds like this is resolved, so closing. Let us know if anything else is needed and we can reopen.

popcornmix avatar Jul 08 '24 12:07 popcornmix