rpi-zero icon indicating copy to clipboard operation
rpi-zero copied to clipboard

Boot hang (with Firmware transaction timeout error) on Raspberry Pi 4B

Open wizeman opened this issue 3 years ago • 7 comments

I have a Raspberry Pi 4B (8 GB model) which hangs at boot with the following errors (apologies for the photo, but I don't have a serial console at hand):

IMG_20210722_180151

This happens 100% of the time with both mainline kernels 5.10.52 and 5.13.4.

The only connected peripherals are an SSD (which is the boot disk) connected over a Startech SATA-USB 3.1 adapter cable, an ethernet cable, HDMI adapter and the official Raspberry Pi power brick. No SD card or other USB devices (such as a keyboard) are connected.

I've tested disconnecting either the network cable or the HDMI adapter but it still hangs, 100% of the time (as far as I can tell - although I'm not 100% certain without the HDMI output).

Interestingly, I have another Raspberry Pi 4B which also has 8 GB of RAM and is the exact same board revision (0xd03114) which boots perfectly fine, 100% of the time, with the exact same power brick and peripherals attached (including the exact same USB disk with the exact same contents).

This would indicate that there is a hardware problem, however (and quite surprisingly!) Raspberry Pi's kernel 5.10.52 does boot without any issues, 100% of the time. I have also booted an official image of Rasperry Pi OS (raspbian?), which I assume uses Raspberry Pi's kernel, and it also booted many times without any problems.

Do you know if there is a quicker way to find out what's going on without having to bisect kernels (which would take a long time given the way kernels are built on NixOS) and without using a serial console?

wizeman avatar Jul 24 '21 19:07 wizeman

@wizeman A firmware transaction timeout indicate an issue with the Videocore firmware. So i don't believe bisecting the kernel will help to narrow down this issue. I suggest to enable debug symbols for stacktraces CONFIG_KALLSYMS.

Another idea is build the mainline kernel and replace it on a Raspberry Pi OS image. Just to see if it's reproducible there.

At least it would be helpful to know the kernel config.

lategoodbye avatar Jul 24 '21 23:07 lategoodbye

Here's a stack trace with CONFIG_KALLSYMS enabled:

IMG_20210726_024800

Here's the full kernel config corresponding to the kernel in the above screenshot:

config.txt

@lategoodbye Note that my main testing setup is the NixOS minimal installation aarch64 image, which is almost 100% reproducible. My latest tests were with Raspberry Pi firmware release 1.20210527 (the latest stable release) and with the latest stable EEPROM image. The wireless firmware is older, since I haven't updated it.

Here's how it went:

  1. Booting installation image with mainline kernel 5.10.52 fails as above.
  2. Booting installation image with mainline kernel 5.13.4 fails as above.
  3. Installation image with Raspberry Pi kernel 5.10.52 boots successfully.

Note that the change from 1 -> 2 and from 2 -> 3 is literally a 1-line code change (to specify which kernel to use). Everything else in the installation image is exactly the same, as well as the hardware. However, some kernel options do change between kernels because of different upstream defaults, no-longer existing kernel config options or new kernel options, depending on which kernel is being used. All kernels are built from scratch by the NixOS build system (to ensure reproducibility).

Let me know if any more info would be helpful.

Thanks!

wizeman avatar Jul 26 '21 01:07 wizeman

I've corrected my comments to indicate that I'm actually using 5.x.x kernels, not 4.x.x. Sorry for any possible confusion...

wizeman avatar Jul 26 '21 01:07 wizeman

I never worked with NixOS before. According to the RPi 4 instructions there are two possible images (generic or new kernel), which one do you use?

Does the issue also occur on SD card boot?

lategoodbye avatar Jul 26 '21 16:07 lategoodbye

I never worked with NixOS before. According to the RPi 4 instructions there are two possible images (generic or new kernel), which one do you use?

I was using a customized image, which shares most of the configuration with NixOS's generic image, except it boots directly to the Linux kernel rather than using u-boot or the ARM stub. This is very similar to the configuration I use on my other Raspberry Pi 4s. First I tried using kernel 5.10.52 (which was working fine for me) but then I switched to 5.13.4 for debugging purposes. My kernel has a few config changes, most related to kernel hardening, which I've been using for years on other machines (including the other Raspberry Pis).

I've also tried downloading and booting NixOS's official generic aarch64 generic SD image (both on a USB disk and on a SD card) but it gets stuck on the rainbow screen even though the exact same disk media boots and works fine on my identical but known-to-be-good RPi, using the exact same peripherals (disk media, HDMI adapter, power brick and ethernet cable).

Does the issue also occur on SD card boot?

Yes.

Ok, so I've been trying to debug this for days and this is what I found out:

  • First of all, I completely removed the Raspberry Pi wireless firmware package, because it was a bit old, hard to update and I didn't want it risk interfering with the rest of the Raspberry Pi firmware or kernel. This should only cause Bluetooth and Wifi to stop working, but it should be fine for me because I'm not using either. This didn't cause any problems for me when booting the image on a known-to-be-good RPi, but it also didn't seem to help on the troublesome RPi (which is otherwise identical to the known-to-be-good one).

  • The raspberrypi-exp-gpio soc:firmware:gpio: Failed to get GPIO 1 config (-110 81) errors seem to disappear when the CONFIG_GPIO_RASPBERRYPI_EXP kernel config option is disabled.

This allows the kernel to continue booting instead of hanging.

However, the Firmware transaction timeout warning and stack trace still appears and then the boot process gets stuck when waiting for the root partition to appear because:

  1. The USB stack starts getting -110 errors as well, and no USB devices are detected, so when booting from a USB disk it doesn't become visible to the kernel.
  2. No SD card is detected, because apparently (and to me, unintuitively), the sdhci-iproc code (i.e. the emmc / SD card controller driver or what have you) only detects SD cards when CONFIG_GPIO_RASPBERRY_EXP is enabled.

I've verified that no SD card is detected when CONFIG_GPIO_RASPBERRY_EXP is disabled even on my known-to-be-good RPi, so it doesn't seem to be a problem specific to the troublesome one.

Note that, as long as CONFIG_GPIO_RASPBERRY_EXP is left enabled, all of the images I've built and booted either on a USB disk or on an SD card work perfectly fine on my known-to-be-good RPi, but the only images that work on my identical but troublesome RPi are the ones which have the Raspberry Pi kernel, for some mysterious reason.

  • I've installed this project, which is a UEFI firmware image for the RPi 4, on an otherwise freshly formatted SD card. The UEFI firmware boots and works fine on my known-to-be-good Raspberry Pi but it hangs on a black screen (just after the rainbow screen) on my identical but troublesome one.

  • I've also tried using the same boot config (except for some of my customized kernel config options) as the official NixOS aarch64 SD image, i.e. with u-boot and the ARM stub. This image also works fine on my known-to-be-good Raspberry Pi but also gets stuck on the rainbow screen on my identical but troublesome RPi.

  • I've also tried to reflash the EEPROM to factory defaults using the Raspberry Pi imager, but it doesn't seem to help.

Since I'm spending way too much of my time on this, at this point I'm just about ready to give up and stop using this recently purchased Raspberry Pi, even though it works with the official Raspberry Pi kernel, because I specifically bought it assuming I would be able to use the mainline kernel just like my other Raspberry Pi 4s...

wizeman avatar Jul 29 '21 02:07 wizeman

FWIW here is a short analysis from my side. This firmware transaction timeout is a warning which should never happend (no response from the VideoCore mailbox after 1 second). In most cases it's a crash of the VideoCore firmware. So it's not the mainline kernel to blame for, it only triggers this issue for unknown reasons.

lategoodbye avatar Jul 29 '21 05:07 lategoodbye

I ended up buying a new, identical RPi, which doesn't have this problem anymore (just like my known-to-be-good one).

Eventually, I would like to bisect the kernels and see exactly which commit in the Raspberry Pi kernel seems to work around this issue, but this is very low priority for me and I'm not sure when I'll be able to do that.

Feel free to close this issue if you'd like.

Thanks!

wizeman avatar Jul 29 '21 17:07 wizeman