linux
linux copied to clipboard
Crash in i2c_bcm2835 in 4.10 due to NULL pointer dereference
After some time of uptime on the Raspberry Pi 3 we're seeing a crash with 4.10 in the i2c_bcm2835 driver. The RPI2 is stable. It happens even on a minimal install with no GUI running.
The crash I captured on the serial console is:
[18589.291098] i2c-bcm2835 3f805000.i2c: i2c transfer timed out [18600.555282] i2c-bcm2835 3f805000.i2c: i2c transfer timed out [18611.819194] i2c-bcm2835 3f805000.i2c: i2c transfer timed out [18623.083248] i2c-bcm2835 3f805000.i2c: i2c transfer timed out [18624.107255] i2c-bcm2835 3f805000.i2c: i2c transfer timed out [18635.371300] i2c-bcm2835 3f805000.i2c: i2c transfer timed out [26274.450689] Unable to handle kernel NULL pointer dereference at virtual address 00000002 [26274.461681] pgd = c0204000 [26274.465918] [00000002] *pgd=00000000 [26274.471137] Internal error: Oops: 5 [#1] SMP ARM [26274.477542] Modules linked in: i2c_dev ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute ebtable_nat ip6table_raw ip6table_mangle ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 iptable_raw iptable_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables ip_set nfnetlink bridge stp llc libcrc32c crc32_arm_ce smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core bcm2835_wdt bcm2835_rng bcm2835_dma leds_gpio mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 [last unloaded: ip6_tables] [26274.562827] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [26274.578954] Hardware name: Generic DT based system [26274.587990] task: c0dc7a80 task.stack: c0dbe000 [26274.596870] PC is at bcm2835_i2c_isr+0x34/0x168 [i2c_bcm2835] [26274.607158] LR is at __handle_irq_event_percpu+0xbc/0x208 [26274.617195] pc : [] lr : [ ] psr: 000f0193 [26274.617195] sp : c0dbfea8 ip : 00000015 fp : ffffe000 [26274.638400] r10: c0dbfeec r9 : ef084200 r8 : 00000000 [26274.648584] r7 : 00000014 r6 : 00000053 r5 : ef084200 r4 : ef3ee018 [26274.660256] r3 : 00000052 r2 : 00000000 r1 : ef3ee018 r0 : 00000053 [26274.672004] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none [26274.684649] Control: 10c5383d Table: 2d9d406a DAC: 00000051 [26274.695916] Process swapper/0 (pid: 0, stack limit = 0xc0dbe220) [26274.707595] Stack: (0xc0dbfea8 to 0xc0dc0000) [26274.717695] fea0: ee518fc0 ef084200 00000053 c02b0064 ef084200 c0dbfeec [26274.737688] fec0: ef084200 ef084200 ef084200 c0dbb298 00000014 ef02d880 c0dbe000 00000000 [26274.757801] fee0: 00000000 c02b01d0 ef084200 00000000 ef084200 ef084200 ef084260 c02b0258 [26274.778567] ff00: ef084200 ef084210 c0dbb298 c02b3bc0 c0dc4300 00000000 c0dbb298 c02af330 [26274.799527] ff20: c0dc4300 c055fe7c 00000000 c02af330 00000000 c02af914 c02108d8 600f0013 [26274.820453] ff40: ffffffff c0dbff84 c0dc2ddc c091c9cc 00000001 00000000 00000000 c0224360 [26274.841356] ff60: 00000000 00000000 ffffe000 c0dc2d74 c0dc2ddc 410fd034 00000000 00000000 [26274.862230] ff80: 00000000 c0dbffa0 c02108e8 c02108d8 600f0013 ffffffff 00000051 00000000 [26274.883081] ffa0: 00000000 c0299cc4 000000bb 00000001 c0dc2d40 ffffffff c0d35e30 c0299ff4 [26274.904270] ffc0: c0ed4cc0 c0c9ac38 ffffffff ffffffff 00000000 c0c9a690 00000000 c0d35e30 [26274.925889] ffe0: c0ed5694 c0dc2d5c c0d35e2c c0dc97cc 0020406a 0020807c 00000000 00000000 [26274.947902] [ ] (bcm2835_i2c_isr [i2c_bcm2835]) from [ ] (__handle_irq_event_percpu+0xbc/0x208) [26274.972652] [ ] (__handle_irq_event_percpu) from [ ] (handle_irq_event_percpu+0x20/0x5c) [26274.997275] [ ] (handle_irq_event_percpu) from [ ] (handle_irq_event+0x4c/0x70) [26275.021345] [ ] (handle_irq_event) from [ ] (handle_level_irq+0xf0/0x124) [26275.045175] [ ] (handle_level_irq) from [ ] (generic_handle_irq+0x20/0x30) [26275.069247] [ ] (generic_handle_irq) from [ ] (bcm2836_chained_handle_irq+0x38/0x40) [26275.094256] [ ] (bcm2836_chained_handle_irq) from [ ] (generic_handle_irq+0x20/0x30) [26275.119292] [ ] (generic_handle_irq) from [ ] (__handle_domain_irq+0xa8/0xbc) [26275.143715] [ ] (__handle_domain_irq) from [ ] (__irq_svc+0x6c/0x90) [26275.159678] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [26275.175051] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [26275.190263] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [26275.205692] [ ] (cpu_startup_entry) from [ ] (start_kernel+0x364/0x3dc) [26275.229119] Code: 1a000011 e3130002 0a00001c e5912290 (e1d220b2) [26275.250980] ---[ end trace be30787d8deafdda ]--- [26275.263100] Kernel panic - not syncing: Fatal exception in interrupt [26275.276975] CPU1: stopping [26275.286950] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [26275.310857] Hardware name: Generic DT based system [26275.322801] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [26275.337806] [ ] (show_stack) from [ ] (dump_stack+0x80/0xa0) [26275.352203] [ ] (dump_stack) from [ ] (handle_IPI+0x158/0x2a4) [26275.366654] [ ] (handle_IPI) from [ ] (__irq_svc+0x6c/0x90) [26275.380719] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [26275.394992] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [26275.409108] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [26275.423439] [ ] (cpu_startup_entry) from [] (0x209a2c) [26275.436869] CPU2: stopping [26275.445968] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [26275.468107] Hardware name: Generic DT based system [26275.479153] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [26275.493255] [ ] (show_stack) from [ ] (dump_stack+0x80/0xa0) [26275.506750] [ ] (dump_stack) from [ ] (handle_IPI+0x158/0x2a4) [26275.520298] [ ] (handle_IPI) from [ ] (__irq_svc+0x6c/0x90) [26275.533469] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [26275.546848] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [26275.560125] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [26275.573677] [ ] (cpu_startup_entry) from [] (0x209a2c) [26275.586404] CPU3: stopping [26275.594848] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [26275.615761] Hardware name: Generic DT based system [26275.626320] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [26275.640021] [ ] (show_stack) from [ ] (dump_stack+0x80/0xa0) [26275.653192] [ ] (dump_stack) from [ ] (handle_IPI+0x158/0x2a4) [26275.666501] [ ] (handle_IPI) from [ ] (__irq_svc+0x6c/0x90) [26275.679472] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [26275.692761] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [26275.705926] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [26275.719374] [ ] (cpu_startup_entry) from [] (0x209a2c) [26275.732064] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
If I boot with cpuidle.off=1 to see if it's cpu idle coming into play I get a very similar crash after about 30 mins of uptime too:
[ 1819.171351] Unable to handle kernel NULL pointer dereference at virtual address 00000002 [ 1819.179873] pgd = c0204000 [ 1819.182713] [00000002] *pgd=00000000 [ 1819.186475] Internal error: Oops: 5 [#1] SMP ARM [ 1819.191322] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6tab le_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip 6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt dwc2 udc_core bcm2835_wdt bcm2835_rng bcm2835_dma leds_gpio i2c_dev mmc_block sdhci_iproc sdhci_p ltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 [ 1819.253338] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1819.261928] Hardware name: Generic DT based system [ 1819.266958] task: c0dc7a80 task.stack: c0dbe000 [ 1819.271725] PC is at bcm2835_i2c_isr+0x34/0x168 [i2c_bcm2835] [ 1819.283681] LR is at __handle_irq_event_percpu+0xbc/0x208 [ 1819.295326] pc : [] lr : [ ] psr: 000f0193 [ 1819.295326] sp : c0dbfe18 ip : 00000015 fp : ffffe000 [ 1819.319345] r10: c0dbfe5c r9 : ef084200 r8 : 00000000 [ 1819.330647] r7 : 00000014 r6 : 00000053 r5 : ef084200 r4 : ee932818 [ 1819.343199] r3 : 00000052 r2 : 00000000 r1 : ee932818 r0 : 00000053 [ 1819.355637] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none [ 1819.368883] Control: 10c5383d Table: 2d0b806a DAC: 00000051 [ 1819.380618] Process swapper/0 (pid: 0, stack limit = 0xc0dbe220) [ 1819.392596] Stack: (0xc0dbfe18 to 0xc0dc0000) [ 1819.402860] fe00: eeb58900 ef084200 [ 1819.422574] fe20: 00000053 c02b0064 ef084200 c0dbfe5c ef084200 ef084200 ef084200 c0dbb298 [ 1819.442330] fe40: 00000014 ef02d880 c0dbe000 c09178dc c0dbff4c c02b01d0 ef084200 00000000 [ 1819.462737] fe60: ef084200 ef084200 ef084260 c02b0258 ef084200 ef084210 c0dbb298 c02b3bc0 [ 1819.483322] fe80: c0dc4300 00000000 c0dbb298 c02af330 c0dc4300 c055fe7c 00000000 c02af330 [ 1819.503893] fea0: 00000000 c02af914 c027b4f8 600f0013 ffffffff c0dbfef4 ef7010c0 c091c9cc [ 1819.524449] fec0: ef303a00 00000002 c0dbe000 0000ed88 ef303a00 ee93cc40 c0dbd0c0 c0dc7a80 [ 1819.544987] fee0: ef7010c0 2e944000 c09178dc c0dbff4c 00000000 c0dbff10 c09178dc c027b4f8 [ 1819.565526] ff00: 600f0013 ffffffff 00000051 00000000 00000001 c0275790 ffffffff 00000000 [ 1819.586413] ff20: 00000002 00000002 eeb8a018 c0dc7a80 ef303a00 ee93cc40 00000000 ef7010c0 [ 1819.607730] ff40: ef7010c0 c09261e0 c0dbffa4 c09178dc 000001a7 c0dc2ddc 410fd034 00000000 [ 1819.629411] ff60: 00000000 c02d3e74 c0917aac 00000000 8f0ef8f4 000001a7 ef6fdb00 8b14da52 [ 1819.651639] ff80: 000001a7 ffffe000 00000001 c0dc2d40 ffffffff c0d35e30 410fd034 00000000 [ 1819.674348] ffa0: c0dbffb4 c0917aac 000000bb 00000001 c0dbffbc c0917df0 00000000 c0299ff4 [ 1819.697300] ffc0: c0ed4cc0 c0c9ac38 ffffffff ffffffff 00000000 c0c9a690 00000000 c0d35e30 [ 1819.720615] ffe0: c0ed5694 c0dc2d5c c0d35e2c c0dc97cc 0020406a 0020807c 00000000 00000000 [ 1819.744346] [ ] (bcm2835_i2c_isr [i2c_bcm2835]) from [ ] (__handle_irq_event_percpu+0xbc/0x208) [ 1819.770266] [ ] (__handle_irq_event_percpu) from [ ] (handle_irq_event_percpu+0x20/0x5c) [ 1819.795564] [ ] (handle_irq_event_percpu) from [ ] (handle_irq_event+0x4c/0x70) [ 1819.820056] [ ] (handle_irq_event) from [ ] (handle_level_irq+0xf0/0x124) [ 1819.844022] [ ] (handle_level_irq) from [ ] (generic_handle_irq+0x20/0x30) [ 1819.868100] [ ] (generic_handle_irq) from [ ] (bcm2836_chained_handle_irq+0x38/0x40) [ 1819.893115] [ ] (bcm2836_chained_handle_irq) from [ ] (generic_handle_irq+0x20/0x30) [ 1819.918149] [ ] (generic_handle_irq) from [ ] (__handle_domain_irq+0xa8/0xbc) [ 1819.942571] [ ] (__handle_domain_irq) from [ ] (__irq_svc+0x6c/0x90) [ 1819.958545] [ ] (__irq_svc) from [ ] (finish_task_switch+0x104/0x21c) [ 1819.974554] [ ] (finish_task_switch) from [ ] (__schedule+0x614/0x74c) [ 1819.998167] [ ] (__schedule) from [ ] (schedule+0x98/0xbc) [ 1820.013104] [ ] (schedule) from [ ] (schedule_preempt_disabled+0x14/0x18) [ 1820.036858] [ ] (schedule_preempt_disabled) from [ ] (cpu_startup_entry+0x20/0x24) [ 1820.061448] [ ] (cpu_startup_entry) from [ ] (start_kernel+0x364/0x3dc) [ 1820.085057] Code: 1a000011 e3130002 0a00001c e5912290 (e1d220b2) [ 1820.107028] ---[ end trace b95fea60a12456d3 ]--- [ 1820.119235] Kernel panic - not syncing: Fatal exception in interrupt [ 1820.133194] CPU3: stopping [ 1820.143265] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1820.167332] Hardware name: Generic DT based system [ 1820.179347] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [ 1820.194425] [ ] (show_stack) from [ ] (dump_stack+0x80/0xa0) [ 1820.208896] [ ] (dump_stack) from [ ] (handle_IPI+0x158/0x2a4) [ 1820.223416] [ ] (handle_IPI) from [ ] (__irq_svc+0x6c/0x90) [ 1820.237567] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [ 1820.251917] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [ 1820.266102] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [ 1820.280493] [ ] (cpu_startup_entry) from [] (0x209a2c) [ 1820.293997] CPU2: stopping [ 1820.303163] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1820.325433] Hardware name: Generic DT based system [ 1820.336563] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [ 1820.350811] [ ] (show_stack) from [ ] (dump_stack+0x80/0xa0) [ 1820.364489] [ ] (dump_stack) from [ ] (handle_IPI+0x158/0x2a4) [ 1820.378227] [ ] (handle_IPI) from [ ] (__irq_svc+0x6c/0x90) [ 1820.391621] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [ 1820.405268] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [ 1820.418749] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [ 1820.432444] [ ] (cpu_startup_entry) from [] (0x209a2c) [ 1820.445238] CPU1: stopping [ 1820.453696] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1820.474640] Hardware name: Generic DT based system [ 1820.485210] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [ 1820.498929] [ ] (show_stack) from [ ] (dump_stack+0x80/0xa0) [ 1820.512112] [ ] (dump_stack) from [ ] (handle_IPI+0x158/0x2a4) [ 1820.525440] [ ] (handle_IPI) from [ ] (__irq_svc+0x6c/0x90) [ 1820.538431] [ ] (__irq_svc) from [ ] (arch_cpu_idle+0x24/0x40) [ 1820.551731] [ ] (arch_cpu_idle) from [ ] (do_idle+0xfc/0x1d8) [ 1820.564903] [ ] (do_idle) from [ ] (cpu_startup_entry+0x20/0x24) [ 1820.578345] [ ] (cpu_startup_entry) from [] (0x209a2c) [ 1820.591031] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
Hi Peter, recently i posted an untested (i don't have a RPi 3) fix for this on linux-rpi-kernel. Unfortunately the message got blocked because of a suspicious header. Btw this issue is directly related to the "i2c-bcm2835 3f805000.i2c: i2c transfer timed out" messages.
Could you please try this patch?
Building a test kernel with it now
So the "i2c-bcm2835 3f805000.i2c: i2c transfer timed out" errors seem to be gone but I get this crash on boot (this is a minimal install so no X/Wayland GUI). The system is still usable. The only difference is the patch.
[ 52.462794] ------------[ cut here ]------------ [ 52.471797] ------------[ cut here ]------------ [ 52.471829] WARNING: CPU: 1 PID: 841 at lib/dma-debug.c:612 debug_dma_assert_idle+0x148/0x1e0 [ 52.471838] vc4-drm soc:gpu: DMA-API: cpu touching an active dma mapped cacheline [cln=0x00c421c0] [ 52.471842] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_security ebtable_filter ebtables ip6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core leds_gpio bcm2835_dma bcm2835_rng bcm2835_wdt mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 i2c_dev [ 52.472058] CPU: 1 PID: 841 Comm: systemd-udevd Not tainted 4.10.0-0.rc8.git0.1.rpi1.fc25.armv7hl #1 [ 52.472062] Hardware name: Generic DT based system [ 52.472092] [] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [ 52.472109] [ ] (show_stack) from [ ] (dump_stack+0x9c/0xd0) [ 52.472129] [ ] (dump_stack) from [ ] (__warn+0xe4/0x104) [ 52.472148] [ ] (__warn) from [ ] (warn_slowpath_fmt+0x3c/0x4c) [ 52.472167] [ ] (warn_slowpath_fmt) from [ ] (debug_dma_assert_idle+0x148/0x1e0) [ 52.472186] [ ] (debug_dma_assert_idle) from [ ] (wp_page_copy+0xfc/0x644) [ 52.472201] [ ] (wp_page_copy) from [ ] (do_wp_page+0x3a4/0x5ac) [ 52.472214] [ ] (do_wp_page) from [ ] (handle_mm_fault+0xd78/0xe3c) [ 52.472232] [ ] (handle_mm_fault) from [ ] (do_page_fault+0x240/0x3b4) [ 52.472250] [ ] (do_page_fault) from [ ] (do_DataAbort+0x40/0xc0) [ 52.472265] [ ] (do_DataAbort) from [ ] (__dabt_usr+0x40/0x60) [ 52.472271] Exception stack(0xec53ffb0 to 0xec53fff8) [ 52.472282] ffa0: 8142a9d8 81423580 81423660 00000000 [ 52.472296] ffc0: 8142a9d8 bed507f4 bed507f0 b6f4ac88 b6f4ac88 8142c2b8 8142bf50 8142c2b8 [ 52.472307] ffe0: 00000000 bed507e0 7f5ab3fc 7f5e05a4 a00f0010 ffffffff [ 52.472401] ---[ end trace e2930fa58afe35f3 ]--- [ 52.472405] Mapped at: [ 52.472482] vc4_bo_create+0x44/0x108 [vc4] [ 52.472546] vc4_overflow_mem_work+0x20/0x170 [vc4] [ 52.472555] process_one_work+0x324/0x84c [ 52.472563] worker_thread+0x294/0x3f4 [ 52.472572] kthread+0x130/0x14c [ 52.896654] WARNING: CPU: 3 PID: 465 at lib/dma-debug.c:612 debug_dma_assert_idle+0x148/0x1e0 [ 52.917240] vc4-drm soc:gpu: DMA-API: cpu touching an active dma mapped cacheline [cln=0x00c421c0] [ 52.938505] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_security ebtable_filter ebtables ip6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core leds_gpio bcm2835_dma bcm2835_rng bcm2835_wdt mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 i2c_dev [ 53.050163] CPU: 3 PID: 465 Comm: systemd-udevd Tainted: G W 4.10.0-0.rc8.git0.1.rpi1.fc25.armv7hl #1 [ 53.073586] Hardware name: Generic DT based system [ 53.084817] [ ] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [ 53.099196] [ ] (show_stack) from [ ] (dump_stack+0x9c/0xd0) [ 53.113038] [ ] (dump_stack) from [ ] (__warn+0xe4/0x104) [ 53.126566] [ ] (__warn) from [ ] (warn_slowpath_fmt+0x3c/0x4c) [ 53.140765] [ ] (warn_slowpath_fmt) from [ ] (debug_dma_assert_idle+0x148/0x1e0) [ 53.162452] [ ] (debug_dma_assert_idle) from [ ] (wp_page_copy+0xfc/0x644) [ 53.184062] [ ] (wp_page_copy) from [ ] (do_wp_page+0x3a4/0x5ac) [ 53.198517] [ ] (do_wp_page) from [ ] (handle_mm_fault+0xd78/0xe3c) [ 53.213312] [ ] (handle_mm_fault) from [ ] (do_page_fault+0x240/0x3b4) [ 53.235632] [ ] (do_page_fault) from [ ] (do_DataAbort+0x40/0xc0) [ 53.250651] [ ] (do_DataAbort) from [ ] (__dabt_usr+0x40/0x60) [ 53.265603] Exception stack(0xc2291fb0 to 0xc2291ff8) [ 53.277933] 1fa0: 8142a518 00000009 00000024 00000000 [ 53.300414] 1fc0: b6df17a4 00000020 00020338 7f5abb04 00000000 8142a518 813d5cf4 8142c2b8 [ 53.322682] 1fe0: bed507bc bed507f0 00000003 b6d0f184 20070010 ffffffff [ 53.336720] ---[ end trace e2930fa58afe35f4 ]--- [ 53.348371] Mapped at: [ 53.357587] vc4_bo_create+0x44/0x108 [vc4] [ 53.368574] vc4_overflow_mem_work+0x20/0x170 [vc4] [ 53.380095] process_one_work+0x324/0x84c [ 53.390631] worker_thread+0x294/0x3f4 [ 53.400620] kthread+0x130/0x14c
And after a while of running I've got this one (which I expected given the warning in the patch) but posting it for reference:
[ 2168.372043] ------------[ cut here ]------------ [ 2168.382726] WARNING: CPU: 0 PID: 317 at drivers/i2c/busses/i2c-bcm2835.c:199 bcm2835_i2c_isr+0x80/0x190 [i2c_bcm2835] [ 2168.405576] Got interrupt, but curr_msg is NULL [ 2168.416141] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_security ebtable_filter ebtables ip6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core leds_gpio bcm2835_dma bcm2835_rng bcm2835_wdt mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 i2c_dev [ 2168.526466] CPU: 0 PID: 317 Comm: kworker/0:2 Tainted: G W 4.10.0-0.rc8.git0.1.rpi1.fc25.armv7hl #1 [ 2168.549331] Hardware name: Generic DT based system [ 2168.560469] Workqueue: events output_poll_execute [drm_kms_helper] [ 2168.572952] [] (unwind_backtrace) from [ ] (show_stack+0x18/0x1c) [ 2168.587086] [ ] (show_stack) from [ ] (dump_stack+0x9c/0xd0) [ 2168.600643] [ ] (dump_stack) from [ ] (__warn+0xe4/0x104) [ 2168.613905] [ ] (__warn) from [ ] (warn_slowpath_fmt+0x3c/0x4c) [ 2168.627668] [ ] (warn_slowpath_fmt) from [ ] (bcm2835_i2c_isr+0x80/0x190 [i2c_bcm2835]) [ 2168.649538] [ ] (bcm2835_i2c_isr [i2c_bcm2835]) from [ ] (__handle_irq_event_percpu+0x118/0x4a8) [ 2168.672485] [ ] (__handle_irq_event_percpu) from [ ] (handle_irq_event_percpu+0x20/0x5c) [ 2168.695293] [ ] (handle_irq_event_percpu) from [ ] (handle_irq_event+0x40/0x64) [ 2168.717857] [ ] (handle_irq_event) from [ ] (handle_level_irq+0xe4/0x118) [ 2168.740506] [ ] (handle_level_irq) from [ ] (generic_handle_irq+0x20/0x30) [ 2168.763426] [ ] (generic_handle_irq) from [ ] (bcm2836_chained_handle_irq+0x3c/0x44) [ 2168.787525] [ ] (bcm2836_chained_handle_irq) from [ ] (generic_handle_irq+0x20/0x30) [ 2168.811673] [ ] (generic_handle_irq) from [ ] (__handle_domain_irq+0xa8/0xbc) [ 2168.835207] [ ] (__handle_domain_irq) from [ ] (__irq_svc+0x70/0x98) [ 2168.850739] [ ] (__irq_svc) from [ ] (_raw_spin_unlock_irqrestore+0x60/0x64) [ 2168.874053] [ ] (_raw_spin_unlock_irqrestore) from [ ] (__debug_object_init+0x2ac/0x374) [ 2168.898587] [ ] (__debug_object_init) from [ ] (init_timer_on_stack_key+0x24/0x5c) [ 2168.922919] [ ] (init_timer_on_stack_key) from [ ] (schedule_timeout+0x84/0x4e0) [ 2168.947281] [ ] (schedule_timeout) from [ ] (wait_for_common+0x10c/0x1a4) [ 2168.971395] [ ] (wait_for_common) from [ ] (bcm2835_i2c_xfer+0x104/0x190 [i2c_bcm2835]) [ 2168.996971] [ ] (bcm2835_i2c_xfer [i2c_bcm2835]) from [ ] (__i2c_transfer+0x3dc/0x9cc) [ 2169.022448] [ ] (__i2c_transfer) from [ ] (i2c_transfer+0x90/0xd8) [ 2169.038630] [ ] (i2c_transfer) from [ ] (drm_do_probe_ddc_edid+0xa8/0xf8 [drm]) [ 2169.063870] [ ] (drm_do_probe_ddc_edid [drm]) from [ ] (drm_probe_ddc+0x1c/0x2c [drm]) [ 2169.089550] [ ] (drm_probe_ddc [drm]) from [ ] (vc4_hdmi_connector_detect+0x4c/0x74 [vc4]) [ 2169.115441] [ ] (vc4_hdmi_connector_detect [vc4]) from [ ] (output_poll_execute+0xfc/0x1a4 [drm_kms_helper]) [ 2169.142904] [ ] (output_poll_execute [drm_kms_helper]) from [ ] (process_one_work+0x324/0x84c) [ 2169.168964] [ ] (process_one_work) from [ ] (worker_thread+0x294/0x3f4) [ 2169.192909] [ ] (worker_thread) from [ ] (kthread+0x130/0x14c) [ 2169.208354] [ ] (kthread) from [ ] (ret_from_fork+0x14/0x24) [ 2169.223530] ---[ end trace e2930fa58afe35f5 ]---
This patch doesn't fix the timeout issue. I've heard they aren't reliable to reproduce.
Could you please attach your kernel config? Is this issue reproducible with a stock kernel?
@notro @msperl What's your opinion about my workaround patch above?
@lategoodbye define a "stock kernel". This is the Fedora kernel which is pretty close to upstream vanilla 4.10 linus kernel. We carry a couple of patches for the RPi but nothing around i2c and only a single patch for vc4 (to fix a OOPs which anolt has already sent for upstream).
The kernel config is currently https://pbrobinson.fedorapeople.org/config-4.10.0-0.rc8.git0.1.fc26.armv7hl
This config works fine with <= 4.9 and has only appeared in 4.10. I did pull back some of the i2c patches to 4.9 to try and fix the timeout issue but we saw the same lock ups on the RPi3 so I dropped them as it was the quickest way to move it back to stability.
In terms of reproducible, it occurs fairly regularly on a Fedora 25 minimal (or GUI based) install with the 4.10 kernel. In the case of a minimal install it's basically just sitting there doing nothing. With that it'll basically crash within 2 hours, generally a lot less and often within a few minutes.
With the i2c patches I back ported to 4.9 (https://pbrobinson.fedorapeople.org/bcm283x-fixes-i2c.patch) we had dozens of reports of issues so I would say at least for Fedora it's very reproducible.
What is the procedure under which this happens - i.e. what is the driver that is producing this i2c message? Can the issue be reproduced with an upstream kernel?
What strikes me as strange is that curr_msg is never set to NULL in the code so besides when initializing the I2C driver Looks to me as if it is some memory corruption - a use after free? Or the driver is never used and hence bcm2835_i2c_xfer is never called and still the interrupt is triggered.
Does the 4.8 version of the driver work on 4.10 without any issues when you replace the 4.10 version?
Well the difference between 4.9 and 4.10 are the following patches. If you revert them I fully expect it to operate fine. When I applied these patches to 4.9 we did see the issue on 4.9:
9446f62e8e18057fceb179d947508df2f7253b26 i2c: bcm2835: Add support for dynamic clock e13e19e12f66401ce1e21ab491715835852babe7 i2c: bcm2835: Support i2c-dev ioctl I2C_TIMEOUT ee05fea21b017b81a9477569b6a0c2d8e2946ac9 i2c: bcm2835: Add support for Repeated Start Condition 8d2cc5cc6ee5c0fc48a96bb29af55fc700f66fdf i2c: bcm2835: Can't support I2C_M_IGNORE_NAK 23c9540b3ad1d7473fe40df80074d0fb0bf04869 i2c: bcm2835: Use dev_dbg logging on transfer errors d4030d75c7cbb434b2a3e5f6af5065879d2615a5 i2c: bcm2835: Protect against unexpected TXW/RXR interrupts e2474541032db65d02bf88b6a8c2f954654b443f i2c: bcm2835: Fix hang for writing messages larger than 16 bytes
@msperl The issue is only reproducible with RPi 3. I've the suspicion that the i2c driver is never used and the interrupt is triggered by the faulty pin configuration in DT in possible combination with the firmware. Please compare this and that.
Looking at the traces some more - it seems to be related to hdmi scanning:
[ 2169.115441] [] (vc4_hdmi_connector_detect [vc4]) from [] (output_poll_execute+0xfc/0x1a4 [drm_kms_helper])
So I wonder if HDMI i2c gets "probed" and later "removed" in this sequence somehow? Add a dev_warn() or similar to log that this happens...
Maybe we need to clean BCM2835_I2C_S and Fifos during initialization.
I believe that i2c is used on the Pi3 for EDID detection via an expander and that might explain the issue happening on the RPi3 and not the RPi2. From the commit message of 9d44abbbb8d530e8cc97d71ffcbc0ff3b5553c62
"On Pi0/1/2, we use an external GPIO line for hotplug detection, since the HDMI_HOTPLUG register isn't connected to anything. However, with the Pi3 the HPD GPIO line has moved off to a GPIO expander that will be tricky to get to (the firmware is constantly polling the expander using i2c0, so we'll need to coordinate with it)."
So it might be using i2c0 but @anholt might be able to comment further with more details
A long time ago Gerd was submitting the pi3 patch series with the fxl6408 gpio expander driver present, which would fight over i2c0 (iirc) with the firmware. I think that patch isn't in your tree, though, given that you've got that series of 10s apart i2c timeout log messages (which we still need to delete) that look like our fallback pi3 monitor detection.
So, a curr_msg = NULL means that we haven't even started a transfer on the i2c that's catching the interrupt, since we set it early in bcm2835_i2c_xfer, and don't ever NULL it back out. Here's my guess for what's going on: The firmware is doing activity on i2c0 talking to fxl6408 (undervoltage detection). Generally, the firmware's irq chain is much shorter than Linux's, so it tends to win the race to process the interrupt for i2c0. Occasionally, though, something else is going on and the firmware loses, at which point Linux gets into its irq handler and looks for the current message on i2c0.
@popcornmix has told me that the firmware will hit i2c0 on all boards at various times: "e.g. following under-voltage, over-temperature, a turbo request from arm, or a hard GPU job (e.g. MVC video decode triggers turbo mode)". Some of those don't apply to upstream Linux today (VCHI not enabled means no video decode, and we don't have turbo requests because the cpufreq driver never landed), but I think this also forgot instances like "the 7" display is attached and the firmware is hitting i2c0 for touchscreen".
I think we need to keep i2c0 status="disabled" in the bcm2835-rpi.dtsi, so that we don't ever try to play on their bus. It would be nice if we could get some confirmation that i2c0 is really the bus that's crashing -- do you think you could get some printk in here for figuring out which one it is?
If you've got a patch to add the printk to the places you want I'm happy to do a build and provide any information it outputs.
I think this should do it. Without the curr_msg check, it gave me a bunch of nicely formatted DONE messages.
@anholt with it running with that patch for 90 mins I've had the following two debugs. I'll leave it running over night (or longer if it's useful). Do they provide confirmation as to whether it's i2c0 or not?
[ 4752.959711] i2c-bcm2835 3f205000.i2c: Processing DONE interrupt with no msg [ 4855.360126] i2c-bcm2835 3f205000.i2c: Processing DONE interrupt with no msg
3f205000.i2c is i2c0.
Maybe we should disable the Irq while i2c is not used to avoid such race issues? (Enable_irq on Line 170 and disable_irq on line 178 and 297) Problem is that the irq is shared between all 3 i2c busses, so it may disable irq for all of them
I guess the only safe way is disabling i2c0 on the rpi3 in the dt.
I think we should fix the pin configuration for i2c and pwm on RPi 3 before we play with those irqs.
Is there a patch proposed for the i2c and pwm issues on the RPi3?
I'll prepare a patch. Please give me some time.
So much for me thinking that I had hardened the isr against problems... I didn't catch that i2c0 was enabled in mainline, and I was the one who restructured the downstream dts base files for 4.9 and did study those files in depth...
Well well, yes, disabling i2c0 in mainline is a first step. Maybe something like this at the top of the isr
/* Some comment about the firmware and i2c0 */
if (WARN_ON_ONCE(!i2c_dev->curr_msg))
return IRQ_NONE;
The downstream 2708 driver handles this situation: https://github.com/raspberrypi/linux/blob/rpi-4.9.y/drivers/i2c/busses/i2c-bcm2708.c#L192
We should probably set i2c_dev->curr_msg = NULL
after the transfer as well to increase the chance of getting an error message if someone uses i2c0 together with the firmware.
I think we need a comment somehwere that explains exactly in which situations the firmware uses i2c0.
And I guess this means that pre 4.10 the driver did occasionally acknowledge the i2c0 interrupts which the firmware was waiting for, silently timing out it's transfers.
Thanks for fixing it up @lategoodbye.
@notro Thanks for your suggestion, so my patch wasn't quite correct and in case curr_msg is NULL we should return with IRQ_NONE immediately? But i like Eric's idea of using a dev_warn instead of WARN.
I prefer Eric's too, the log warning is much less scary for the average user (which means less people filing bug reports). I've pushed Eric's patch into the Fedora 4.10 kernel for a work around to the hard crash while we work out the best course of action in terms of DT, irq or what ever the final outcome is.
I think we should bail out at once if we get an unexpected interrupt, since that means the firmware is using the device. And if I'm not mistaken, the interrupt will fire again since we don't acknowledge it and the firmware will probably catch it first the next time. But that means that we will get an interrupt storm if the firmware isn't responsible for the interrupt, since no one will acknowledge it. I don't know what the firmware does with unexpected interrupts. So what's most important, not break the firmware's use of i2c0 or to avoid an unlikely (famous last words) interrupt storm? Maybe your solution is best after all.
dev_warn is ok by me, maybe dev_warn_ratelimited if we don't ack the interrupt. Or rather dev_warn_ratelimited if we don't touch anything, and dev_err if we ack the interrupt since we most likely steal the firmware's interrupt by doing that.
We shouldn't try to harden our isr against spurious IRQs, we should just disable the controller in the DT. We don't get to drive it, and if we ever do anything with it from Linux, only suffering will come of it.
Okay, i send a first patch series to fix the NULL pointer dereference and the invalid pin configuration (without disabling i2c0). If the i2c timeouts still occurs i would add an additional patch to disable i2c0. Since the issue would occur with 4.9 DT blobs and a 4.10 kernel, we also need to fix i2c-bcm2835.