linux icon indicating copy to clipboard operation
linux copied to clipboard

crash in rpi_firmware_property

Open 1ace opened this issue 4 months ago • 8 comments

Describe the bug

(see below)

Steps to reproduce the behaviour

N/A, happens randomly

Device (s)

Raspberry Pi 3 Mod. B+

System

package 1.20250430

Logs

[  423.740983] ------------[ cut here ]------------
[  423.745642] Firmware transaction 0x00038042 timeout
[  423.745711] WARNING: CPU: 0 PID: 9 at drivers/firmware/raspberrypi.c:131 rpi_firmware_property_list+0x21c/0x288
[  423.760686] Modules linked in: nbd i2c_bcm2835 i2c_brcmstb snd_soc_hdmi_codec v3d drm_shmem_helper gpu_sched vc4 snd_soc_core snd_pcm_dmaengine snd_compress snd_pcm snd_timer snd drm_dma_helper drm_kms_helper cec drm_display_helper drm backlight drm_panel_orientation_quirks overlay
[  423.785726] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.12.25-v8+ #1875
[  423.793123] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
[  423.799387] Workqueue: events_freezable mmc_rescan
[  423.804184] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  423.811146] pc : rpi_firmware_property_list+0x21c/0x288
[  423.816374] lr : rpi_firmware_property_list+0x21c/0x288
[  423.821601] sp : ffffffc080063bc0
[  423.824910] x29: ffffffc080063bc0 x28: ffffffe2a1f569c0 x27: ffffff8002b73740
[  423.832055] x26: 00000000ffffff92 x25: ffffffc08069e008 x24: 0000000000001000
[  423.839199] x23: ffffff8008ebc4c0 x22: ffffffe2a20ff490 x21: 0000000000000018
[  423.846343] x20: ffffff8002b73700 x19: ffffffc08069e000 x18: ffffffffffffffff
[  423.853488] x17: 0000000000000000 x16: 0000000000000000 x15: ffffffe2a2195028
[  423.860632] x14: 0000000000000008 x13: ffffffe2a219501a x12: 61736e6172742065
[  423.867776] x11: fffffffffffe0000 x10: ffffffe2a1fd6f40 x9 : ffffffe2a091c3b8
[  423.874920] x8 : 00000000ffffefff x7 : ffffffe2a1fd5398 x6 : 0000000000001ba8
[  423.882064] x5 : ffffff803b1673c8 x4 : 0000000000000000 x3 : 0000000000000027
[  423.889208] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff8002244200
[  423.896352] Call trace:
[  423.898792]  rpi_firmware_property_list+0x21c/0x288
[  423.903673]  rpi_firmware_property+0x78/0xc8
[  423.907945]  bcm2835_set_clock+0x88/0x238
[  423.911955]  bcm2835_set_ios+0x4c/0xb0
[  423.915704]  mmc_set_initial_state+0x90/0xa8
[  423.919974]  mmc_power_up.part.0+0x5c/0x180
[  423.924156]  mmc_rescan+0x184/0x330
[  423.927644]  process_one_work+0x15c/0x3c0
[  423.931657]  worker_thread+0x2e4/0x3f0
[  423.935406]  kthread+0x120/0x130
[  423.938634]  ret_from_fork+0x10/0x20
[  423.942211] ---[ end trace 0000000000000000 ]---

the rest of the log can be found in this ci job: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/81777346#L2136

Additional context

No response

1ace avatar Aug 04 '25 16:08 1ace

The WARNING is a timeout. It could be that the firmware was too slow to respond, or it could be the firmware is crashed.

After you've seen this WARNING, do any other mailbox calls work? e.g. does vcgencmd version respond?

popcornmix avatar Aug 04 '25 18:08 popcornmix

Ah you're right, it's a timeout not a crash, I misread/assumed from seeing too many of those and reading too fast 😅

Another one happened this morning again, in case having more samples helps: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/81927348#L2124

This is in CI though, not an interactive session, so I can't try to run other commands when it seems to hang.

1ace avatar Aug 06 '25 07:08 1ace

This is in CI though, not an interactive session, so I can't try to run other commands when it seems to hang.

Can you not ssh in after the failure? If the firmware has crashed, it will never respond until rebooted. Even connecting in and running vcgencmd version the next day would be fine.

Also knowing the firmware version would be useful (i.e. run command before the crash and report the output) just in case it is unexpectedly outdated.

popcornmix avatar Aug 06 '25 09:08 popcornmix

Can you not ssh in after the failure?

I could, but I would have to see it happen (which so far has been very rare) and be ready to log into the machine before the timeout and it gets powered off. Realistically that won't be possible, but I could run it in a loop and be ready to connect when it happens, but I don't have time to do that right now and I don't know when I'll get to it. I'll try to remember to do it though :+1:

As for the firmware version, it's the one in the 1.20250430 package (which btw is pretty old by now given the usual "1-2 release per month" before that, I don't know if you know why it's being delayed?)

1ace avatar Aug 06 '25 10:08 1ace

As for the firmware version, it's the one in the 1.20250430 package (which btw is pretty old by now given the usual "1-2 release per month" before that, I don't know if you know why it's being delayed?)

We've had a long trend of moving functionality from the (closed source) firmware to (open source) kernel. That just means there is now less code churn in the firmware, so firmware updates are less common.

popcornmix avatar Aug 06 '25 10:08 popcornmix

Oh, I see, so it's kind of good news, except until the upstream kernel is used by the official rpi distro we have to continue testing on the downstream :)

1ace avatar Aug 06 '25 11:08 1ace

I tried the latest commit (https://github.com/raspberrypi/firmware/archive/95be71b8c0f63f03dc06dd0e4c2e5535e6fb4a93.zip) and of the 7 jobs that I tried I already had one hang with the same message (https://gitlab.freedesktop.org/eric/mesa/-/jobs/81964342), so it's definitely still there. I wasn't around when it happened so I didn't get to connect this time, but when I have time I still plan on sitting, waiting for it to start hanging so that I can try what you said.

1ace avatar Aug 06 '25 16:08 1ace

Late answer, but I finally ran the vcgencmd version command as you asked, and it works fine (and outputs version a668b6e6edce3274de221324b93cb8741e4a7f7c (clean) (release) (start)) right until the kernel message above, and then no userspace command executes anymore, including vcgencmd version.

(to be more specific, I ran while sleep 10; do date; echo "ERIC: vcgencmd version"; ./vcgencmd version; done & in a background loop before starting the tests, which is how I know that even a simple date doesn't run anymore, it's really a full kernel hang and not just the drm side)

1ace avatar Aug 20 '25 17:08 1ace