linux icon indicating copy to clipboard operation
linux copied to clipboard

Kernel 6.15+ may have network connectivity issues on the Pi4/CM4

Open ZakKemble opened this issue 4 months ago • 17 comments

Describe the bug

FYI - since kernel 6.15 there may be a lot of network connectivity issues (transmit queue lockup) on the Pi4/CM4 triggered by having traffic coming from both the kernel and a user space application at the same time - like if the pi is setup as a router and file server.

I don't have time to look into it myself, but I have notified the maintainers of the bcmgenet driver and as far as I know it has not yet been fixed - https://lists.openwall.net/netdev/2025/06/27/261

Steps to reproduce the behaviour

Configure Pi as a router (or something where the kernel sends packets) Setup a file server or iperf3 (some user space application that can send packets) Transmit kernel and user space traffic out of the BCM interface at gigabit speed

Device (s)

Raspberry Pi CM4, Raspberry Pi CM4 Lite, Raspberry Pi 4 Mod. B

System

Raspberry Pi reference 2024-11-19 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 891df1e21ed2b6099a2e6a13e26c91dea44b34d4, stage2

Mar 19 2025 18:24:21 Copyright (c) 2012 Broadcom version ca6e8171a80ea46924ffaa629250bfb482f3a02c (clean) (release) (start)

Linux router.localnet 6.12.25-v8-ZAK+ #1 SMP PREEMPT Sat Apr 26 13:37:08 BST 2025 aarch64 GNU/Linux

Logs

Jun 26 14:32:06 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 2004 ms Jun 26 14:32:08 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 4 timed out 2004 ms Jun 26 14:32:09 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 3 timed out 2004 ms Jun 26 14:32:10 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 3 timed out 2892 ms Jun 26 14:32:11 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 3 timed out 3884 ms Jun 26 14:32:12 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 0: transmit queue 1 timed out 2208 ms Jun 26 14:32:13 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 1 timed out 3232 ms Jun 26 14:32:14 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 1 timed out 4224 ms Jun 26 14:32:15 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 5216 ms

Additional context

No response

ZakKemble avatar Aug 30 '25 09:08 ZakKemble

Kernel 6.15+ may have network connectivity issues

Linux router.localnet 6.12.25-v8-ZAK+ https://github.com/raspberrypi/linux/issues/1 SMP PREEMPT Sat Apr 26 13:37:08 BST 2025 aarch64 GNU/Linux

Does the issue only affect 6.15 and later? You seem to have reported you are running 6.12?

popcornmix avatar Aug 30 '25 10:08 popcornmix

Ah sorry for the confusion. Yes the issue is only in 6.15 and later. I've just backported the driver code from 6.16 to 6.12 with some other changes for my system.

ZakKemble avatar Aug 30 '25 10:08 ZakKemble

Oh and backporting the 6.16 driver code to 6.12 also brings along the transmit queue lockup problem, but it's not an issue for my system since it's not running any user space applications that send network traffic on the BCM interface.

ZakKemble avatar Aug 30 '25 17:08 ZakKemble

@ZakKemble Since there were a lot of changes since 6.15, are you able to bisect the issue with a mainline (torvalds) kernel?

lategoodbye avatar Aug 31 '25 09:08 lategoodbye

@ZakKemble Since there were a lot of changes since 6.15, are you able to bisect the issue with a mainline (torvalds) kernel?

As mentioned, I don't have time to do this.

ZakKemble avatar Aug 31 '25 09:08 ZakKemble

Edit: I'm able to reproduce this issue with mainline kernel 6.15 (arm64/defconfig) using iperf3.

Setup: Notebook (Server) --- 1 Gigabit --- Raspberry Pi 4 B (Client)

Running 10 parallel clients on Raspberry Pi side seems to trigger this.

lategoodbye avatar Aug 31 '25 10:08 lategoodbye

@ffainelli I didn't have the time to analyze this issue properly, but I want to share my observations here. The following commit list shows how many parallel iperf clients are necessary to trigger at least 1 transmit queue timeout on Raspberry Pi 4 (8 GB RAM, arm64defconfig):

0ff41df1cb26 : parallel clients 2 and above d2b41068056b : parallel clients 2 and above 64fdb808660d : parallel clients 3 and above 38fec10eb60d : parallel clients 4 and above

The fact that v6.14 also shows transmit queue timeouts, let think that it's not a single commit which introduced a regression and in worst case it never worked properly before.

lategoodbye avatar Sep 09 '25 16:09 lategoodbye

I think that problem has always been there, I have seen it for as long as I had a Raspberry Pi 4 in my home, which is circa 5 years. Queues 0-3 are configured with 32 descriptors available, which is very few, while queue 16 is configured with 128 descriptors. As long as the timeouts are recoverable, I don't necessarily consider that a bug, but an annoyance.

ffainelli avatar Sep 09 '25 17:09 ffainelli

The problem I've been having with 6.15+ is that the transmit timeouts do not recover. Notice the ever-increasing millisecond timer in the kernel output.

Jun 26 14:32:21 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 11200 ms
Jun 26 14:32:22 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 12224 ms
Jun 26 14:32:23 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 13216 ms
Jun 26 14:32:24 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 3: transmit queue 1 timed out 14208 ms
Jun 26 14:32:25 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 1 timed out 15200 ms

ZakKemble avatar Sep 09 '25 17:09 ZakKemble

OK, we will try to reproduce and fix it.

ffainelli avatar Sep 09 '25 17:09 ffainelli

The problem I've been having with 6.15+ is that the transmit timeouts do not recover. Notice the ever-increasing millisecond timer in the kernel output.

Jun 26 14:32:21 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 11200 ms
Jun 26 14:32:22 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 12224 ms
Jun 26 14:32:23 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 2: transmit queue 1 timed out 13216 ms
Jun 26 14:32:24 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 3: transmit queue 1 timed out 14208 ms
Jun 26 14:32:25 router.localnet kernel: bcmgenet fd580000.ethernet lan0: NETDEV WATCHDOG: CPU: 1: transmit queue 1 timed out 15200 ms

Sorry, I didn't want to negate your observations regarding recoverbility and thanks for your feedback. In my eyes there are two issues:

  • userspace is able to trigger netdev watchdog (currently this feature is not really helpful in this driver)
  • it is possible to get bcmgenet in a non-recoverable state

lategoodbye avatar Sep 09 '25 18:09 lategoodbye

I have been experiencing this, but can't trigger it manually. Currently, it just seems like the forwarding stops at "random" times. But it could be days or weeks between issues. I tried iperf3 in multiple different ways, but haven't been successful.

Is there some extra tracing on the bcmgenet driver which might be helpful?

It's definitely unrecoverable and reboot is the only option.

aplund avatar Nov 03 '25 03:11 aplund

Just to add to my above comment. I was testing with 6.12.55, and after running that for over a week, I haven't come across this transmit queue timeout message. However, the last boot, with the same kernel, had the error pop up after an uptime of about 2.5 days. So I'm at a bit of a loss to recreate it.

aplund avatar Nov 09 '25 22:11 aplund

@aplund The kernel must also send packets at the same time as iperf3. When both the kernel and a user space application are sending packets and saturating the BCM interface the TX queue lockup happens almost immediately.

There is a module called pktgen which can generate and send packets from within the kernel, but I've not used it before. Otherwise, the Pi can be setup as a router or bridge with a second USB-ethernet interface.

Also this only effects 6.15+

ZakKemble avatar Nov 09 '25 23:11 ZakKemble

@aplund The kernel must also send packets at the same time as iperf3. When both the kernel and a user space application are sending packets and saturating the BCM interface the TX queue lockup happens almost immediately.

OK. So I was using iperf3 only on the bcm interface and the packets weren't being forwarded. Is this what you mean?

There is a module called pktgen which can generate and send packets from within the kernel, but I've not used it before. Otherwise, the Pi can be setup as a router or bridge with a second USB-ethernet interface.

This is how this Pi4 is being used. I have 'end0' for a LAN and 'enp1s0u2u3' via USB-ethernet interface for WAN.

Also this only effects 6.15+

Has there been a backport of something to the 6.12 branch?

aplund avatar Nov 10 '25 22:11 aplund

OK. So I was using iperf3 only on the bcm interface and the packets weren't being forwarded. Is this what you mean?

Yea, when the TX lockup occurs nothing is sent out of the BCM interface. That includes traffic from iperf3 and forwarded traffic.

This is how this Pi4 is being used. I have 'end0' for a LAN and 'enp1s0u2u3' via USB-ethernet interface for WAN.

Sounds right

Has there been a backport of something to the 6.12 branch?

No. (apart from what I did for myself)

From what I can tell, the rpi kernel maintainers only release LTS kernel versions, and since the upstream kernel does an LTS release around this time of year I thought I should create this issue to warn you guys of the problem.

Try the bcm2711_build kernel from here https://github.com/raspberrypi/linux/actions/runs/19230176416

ZakKemble avatar Nov 10 '25 23:11 ZakKemble

I'm seeing the issue described here. I haven't tried the kernel linked to above yet.

Details:

  • rpi 4 w/ 2GiB memory
  • Linux mypi 6.12.25+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.12.25-1+rpt1 (2025-04-30) aarch64 GNU/Linux

Lots of the following error messages:

  • Dec 08 09:11:49 mypi kernel: bcmgenet fd580000.ethernet end0: NETDEV WATCHDOG: CPU: 2: transmit queue 0 timed out 2004 ms

I eventually have to reboot from the console. I have a hacky script that reboots when it sees this error in journalctl.

e4jet avatar Dec 08 '25 22:12 e4jet