linux icon indicating copy to clipboard operation
linux copied to clipboard

Onboard Ethernet on Raspberry Pi 5 experiences packet loss in LAN environment

Open iory123 opened this issue 7 months ago • 19 comments

Describe the bug

When using the onboard Ethernet interface (eth0) on Raspberry Pi 5 connected to a standard LAN (local area network), noticeable packet loss occurs—even under normal traffic conditions. This happens intermittently but consistently, even when CPU load is low and IPv6/multicast-related services are disabled. Using a USB Ethernet adapter in the same environment shows no such issue.

Steps to reproduce the behaviour

1.Connect Raspberry Pi 5 to a typical home or office LAN via onboard Ethernet (eth0), using a known-good CAT5e/6 cable.

2.Ensure other LAN devices (PCs, printers, routers) are present and active on the same network.

3.Assign the Pi a static IP (or use DHCP), verify link is up at 1000 Mbps full duplex.

4.From another LAN host, run:

 ping <raspberrypi-ip> -i 0.2  

5.On the Raspberry Pi, monitor packet stats via:

ip -s link show dev eth0  

6.Optionally capture traffic using tcpdump or tshark for later analysis.

7.Observe packet loss (e.g., >10%) and dropped RX packets.

8.Replace onboard Ethernet with a USB3-to-GbE adapter and repeat test — the issue no longer occurs.

Device (s)

Raspberry Pi 5

System

Raspberry Pi reference 2025-05-13 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 5dabc7dc940059dfbc46af5d97b60a1e812523dd, stage4

Copyright (c) 2012 Broadcom version 26826259 (release) (embedded)

Linux RPi5 6.12.25+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.12.25-1+rpt1 (2025-04-30) aarch64 GNU/Linux

Logs

dmesg.log

Additional context

Issue seems related to onboard NIC driver or hardware handling of LAN broadcast/multicast. Even without high CPU usage or large traffic volume, onboard eth0 shows RX drops and packet loss, reproducible in both routed and switched LAN environments. USB adapters do not show the problem.

iory123 avatar May 15 '25 14:05 iory123

What does sudo ethtool -S eth0 report?

nbuchwitz avatar May 18 '25 15:05 nbuchwitz

What does sudo ethtool -S eth0 report?

NIC statistics: tx_octets: 34927 tx_frames: 238 tx_broadcast_frames: 3 tx_multicast_frames: 24 tx_pause_frames: 0 tx_64_byte_frames: 22 tx_65_127_byte_frames: 144 tx_128_255_byte_frames: 62 tx_256_511_byte_frames: 3 tx_512_1023_byte_frames: 4 tx_1024_1518_byte_frames: 3 tx_greater_than_1518_byte_frames: 0 tx_underrun: 0 tx_single_collision_frames: 0 tx_multiple_collision_frames: 0 tx_excessive_collisions: 0 tx_late_collisions: 0 tx_deferred_frames: 0 tx_carrier_sense_errors: 0 rx_octets: 81238 rx_frames: 684 rx_broadcast_frames: 430 rx_multicast_frames: 30 rx_pause_frames: 0 rx_64_byte_frames: 157 rx_65_127_byte_frames: 348 rx_128_255_byte_frames: 144 rx_256_511_byte_frames: 30 rx_512_1023_byte_frames: 1 rx_1024_1518_byte_frames: 4 rx_greater_than_1518_byte_frames: 0 rx_undersized_frames: 0 rx_oversize_frames: 0 rx_jabbers: 0 rx_frame_check_sequence_errors: 0 rx_length_field_frame_errors: 0 rx_symbol_errors: 0 rx_alignment_errors: 0 rx_resource_errors: 0 rx_overruns: 0 rx_ip_header_checksum_errors: 0 rx_tcp_checksum_errors: 0 rx_udp_checksum_errors: 0 q0_rx_packets: 684 q0_rx_bytes: 68926 q0_rx_dropped: 0 q0_tx_packets: 236 q0_tx_bytes: 34121 q0_tx_dropped: 0

PING 192.168.33.1 (192.168.33.1) 56(84) bytes of data. 64 bytes from 192.168.33.1: icmp_seq=1 ttl=64 time=3.69 ms 64 bytes from 192.168.33.1: icmp_seq=2 ttl=64 time=2.01 ms 64 bytes from 192.168.33.1: icmp_seq=3 ttl=64 time=2.19 ms 64 bytes from 192.168.33.1: icmp_seq=4 ttl=64 time=2.39 ms 64 bytes from 192.168.33.1: icmp_seq=5 ttl=64 time=2.09 ms 64 bytes from 192.168.33.1: icmp_seq=6 ttl=64 time=2.08 ms 64 bytes from 192.168.33.1: icmp_seq=7 ttl=64 time=2.51 ms 64 bytes from 192.168.33.1: icmp_seq=9 ttl=64 time=1.38 ms 64 bytes from 192.168.33.1: icmp_seq=10 ttl=64 time=2.16 ms 64 bytes from 192.168.33.1: icmp_seq=11 ttl=64 time=2.28 ms 64 bytes from 192.168.33.1: icmp_seq=12 ttl=64 time=2.35 ms 64 bytes from 192.168.33.1: icmp_seq=13 ttl=64 time=2.08 ms 64 bytes from 192.168.33.1: icmp_seq=15 ttl=64 time=2.14 ms 64 bytes from 192.168.33.1: icmp_seq=16 ttl=64 time=2.26 ms 64 bytes from 192.168.33.1: icmp_seq=17 ttl=64 time=2.27 ms 64 bytes from 192.168.33.1: icmp_seq=18 ttl=64 time=2.06 ms 64 bytes from 192.168.33.1: icmp_seq=19 ttl=64 time=2.25 ms 64 bytes from 192.168.33.1: icmp_seq=21 ttl=64 time=1.77 ms 64 bytes from 192.168.33.1: icmp_seq=22 ttl=64 time=2.28 ms 64 bytes from 192.168.33.1: icmp_seq=23 ttl=64 time=6.19 ms 64 bytes from 192.168.33.1: icmp_seq=24 ttl=64 time=2.43 ms 64 bytes from 192.168.33.1: icmp_seq=25 ttl=64 time=2.03 ms 64 bytes from 192.168.33.1: icmp_seq=27 ttl=64 time=2.46 ms 64 bytes from 192.168.33.1: icmp_seq=28 ttl=64 time=2.43 ms 64 bytes from 192.168.33.1: icmp_seq=29 ttl=64 time=1.96 ms 64 bytes from 192.168.33.1: icmp_seq=30 ttl=64 time=2.12 ms 64 bytes from 192.168.33.1: icmp_seq=31 ttl=64 time=2.27 ms 64 bytes from 192.168.33.1: icmp_seq=33 ttl=64 time=1.96 ms 64 bytes from 192.168.33.1: icmp_seq=34 ttl=64 time=2.61 ms 64 bytes from 192.168.33.1: icmp_seq=35 ttl=64 time=1.88 ms ^C --- 192.168.33.1 ping statistics --- 35 packets transmitted, 30 received, 14.2857% packet loss, time 34091ms rtt min/avg/max/mdev = 1.377/2.351/6.185/0.799 ms

iory123 avatar May 18 '25 22:05 iory123

Short update: We have started to see similar if not the same issues (rp1 ethernet adapter shows huge package loss, whereas other adapters like the lan7430 on pcie won't, error counter show nothing suspicious) with CM5 based devices in our production. Haven't found a smoking gun so far, but I will investigate further, so I can provide more details.

nbuchwitz avatar Jun 11 '25 22:06 nbuchwitz

On my last visit at our factory, I took some of the affected devices with me in order to further investigate:

The cadence rp1 / phy combo doesn't advertise itself as EEE compatible, so let's assume that's correct (BCM54213PE phy does support EEE). Interestingly if I connect a EEE capable adapter directly to a affected rp1 port it reports a EEE capable partner with active status:

$ ethtool --show-eee enx00e04c68030c                                          
EEE settings for enx00e04c68030c:                                                             
        EEE status: enabled - active                                                          
        Tx LPI: disabled                                                                      
        Supported EEE link modes:  100baseT/Full                                              
                                   1000baseT/Full                                             
        Advertised EEE link modes:  100baseT/Full                                             
                                    1000baseT/Full                                            
        Link partner advertised EEE link modes:  100baseT/Full                                
                                                 1000baseT/Full   

In every case the link is detected as EEE active I see a very high package loss rate. In most cases it even fails to get an IP via DHCP. If I enforce the partner to disable EEE the loss is gone and everything works as expected:

ethtool --show-eee enx00e04c68030c                                          
EEE settings for enx00e04c68030c:                                                             
        EEE status: enabled - inactive                                                        
        Tx LPI: disabled                                                                      
        Supported EEE link modes:  100baseT/Full                                              
                                   1000baseT/Full
        Advertised EEE link modes:  Not reported
        Link partner advertised EEE link modes:  Not reported

Interestingly in some boots / reboots the partner device reports the link as not EEE capable and everything works too. In order to make sure that this is not related to a certain partner I used different USB adapters (RTL8153 and AX88179 chipsets). Our devices also have other LAN7430 based ports which work with the same partner device without a flaw (LAN7430 is advertised as EEE capable though).

There are also reports from other users in the forums: https://forums.raspberrypi.com/viewtopic.php?t=360924

So my current guessing is that something is either off in the driver / macphy. This issue doesn't show on all devices (about 10-20% of all tested CM5 in our current production, most of them are the wireless variant). When the CM5 is changed it works in most cases even without manually enforcing EEE to down on partner device. As this happens regularly in our EOL tests, we should be able to provide you with a device for testing if wanted.

nbuchwitz avatar Jun 14 '25 20:06 nbuchwitz

From what I can find on the cadence page the IP block should / could support EEE: https://www.cadence.com/en_US/home/tools/silicon-solutions/protocol-ip/interface-ip/ethernet/ethernet-controller.html

@pelwell, maybe you can check internally with the right people if this is also the case for the block in RP1?

nbuchwitz avatar Jun 14 '25 20:06 nbuchwitz

Pi Towers is a bit of a desert when it comes to EEE - it's not present/active on the house networks I've tried. Fortunately other Pis can be used as link peers, except that when trying it using two Pi 4s one indicated that EEE was active and enabled, while the other end said it was disabled. It just so happened that the "active" end was running an old kernel, and somewhere along the line it got broken - enough to show EEE as being disabled, but not enough for the other end to notice.

The change happened with the jump from 6.8 to 6.9, and so now the aim is to find the commit(s) responsible and determine the extent of the damage. Once I have a known-good peer for testing I'll swap the DUT for a Pi 5 and repeat.

pelwell avatar Jun 16 '25 10:06 pelwell

Shot in the dark (currently in the train so can't have a good look):

"Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers"

This was merged in 6.9

Also genet calls the eee handlers but macb don't. So with rp1 this means that the broadcom phy falls back to it's "legacy" eee mode (called AutogrEEEn). Wasn't able to find any details on how this legacy stuff works though.

nbuchwitz avatar Jun 16 '25 10:06 nbuchwitz

Thanks, Nicolai - this is definitely one of those rare cases of an upstream framework change breaking something.

pelwell avatar Jun 16 '25 11:06 pelwell

Forgot to mention: The cm5 issue also happens with 6.6, so the changes in 6.9 probably only broke eee support for (cm|pi)4

nbuchwitz avatar Jun 16 '25 11:06 nbuchwitz

This is the first bad commit for Pi 4:

commit 522605b4c506faf3f110705f72b04a74822be401
Author: Andrew Lunn <[email protected]>
Date:   Sat Mar 2 20:53:02 2024 +0100

    net: phy: Keep track of EEE configuration
    
    Have phylib keep track of the EEE configuration. This simplifies the
    MAC drivers, in that they don't need to store it.
    
    Future patches to phylib will also make use of this information to
    further simplify the MAC drivers.
    
    Reviewed-by: Russell King (Oracle) <[email protected]>
    Signed-off-by: Andrew Lunn <[email protected]>
    Reviewed-by: Florian Fainelli <[email protected]>
    Signed-off-by: Oleksij Rempel <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jakub Kicinski <[email protected]>

 drivers/net/phy/phy.c | 7 +++++--
 include/linux/phy.h   | 3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

Given that this is common code, it suggests that some driver has not kept up with the change.

[ Time passes ]

No, it's just that the patch that fixes the issue (49168d1980e2), despite its Fixes: tag, did not get back-ported to 6.12. Now done: 4994911f0c4f7704b7ad156f1061d38fed0fd3c6 Now on to the real problem.

pelwell avatar Jun 17 '25 13:06 pelwell

I'm currently flood pinging (sudo ping -f ...) a Pi 5 from a Pi 4 over a direct link using link local addresses, and so far I've not had a single packet loss:

~$ ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 169.254.137.143  netmask 255.255.0.0  broadcast 169.254.255.255
        inet6 fe80::9156:d709:143b:5012  prefixlen 64  scopeid 0x20<link>
        ether 2c:cf:67:70:73:4d  txqueuelen 1000  (Ethernet)
        RX packets 438874  bytes 36867322 (35.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 439009  bytes 44790836 (42.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 112

This is with no kernel changes on the Pi 5. The peer is reporting that EEE is active:

$ ethtool --show-eee eth0
EEE settings for eth0:
        EEE status: enabled - active
        Tx LPI: disabled
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  100baseT/Full
                                                 1000baseT/Full

It would be helpful if I could reproduce the problem...

pelwell avatar Jun 17 '25 14:06 pelwell

As written above this issue doesn't appear to be with all pi5/cm5. In production we see a failure rate of about 10-20% with CM5. I have a defective device with me, which reproduces the issue reproducible. Common with all affected devices is, that the problem appears to be during link auto negotiation (rarely it works, most times high loss / no coms), so I really suspect this to be an issue with the phy.

In order to save you from running through pi towers and checking each an every bcm2712 device for these issues, I can ask production to send an affected device to you. In the meantime I'm happy to test stuff with the device I have with me.

nbuchwitz avatar Jun 17 '25 15:06 nbuchwitz

I can ask production to send an affected device to you.

That would be really helpful.

I've established that the MACB hardware implements EEE, but the Linux driver doesn't include the necessary support. I can imagine how having a PHY mistakenly thinking that the MAC is EEE-aware could cause problems, so it might be best to just merge your DTS patch until we've sorted things out properly.

pelwell avatar Jun 17 '25 15:06 pelwell

I'll make it so.

If you can source a datasheet or at least some insights on the registers of the cadence mac and broadcom phy we can have a look how / if EEE can be implemented.

nbuchwitz avatar Jun 17 '25 15:06 nbuchwitz

Fortunately it looks as though the relevant information has been published by somebody else: https://onlinedocs.microchip.com/oxy/GUID-2ACDA668-0A87-46A1-B7FC-9DC74A5461AD-en-US-2/GUID-F191ED65-94C9-46C4-BF5F-13A9D7FE6E29.html

The relevant registers/bits are:

  • GMAC_NCR: TXLPIEN
  • GMAC_NSR: RXLPIS
  • GMAC_ISR: RXLPISBC
  • GMAC_IER: RXLPISBC
  • GMAC_IDR: RXLPISBC
  • GMAC_IMR: RXLPISBC
  • GMAC_RXLPI
  • GMAC_RXLPITIME
  • GMAC_TXLPI
  • GMAC_TXLPITIME

There are some useful descriptions of the operation in the sections:

  • 40.6.18 Energy Efficient Ethernet Support (https://onlinedocs.microchip.com/oxy/GUID-2ACDA668-0A87-46A1-B7FC-9DC74A5461AD-en-US-2/GUID-DB06487A-C43E-4C0B-A07F-2D7005911EAF.html) and
  • 40.6.19 LPI Operation in the GMAC (https://onlinedocs.microchip.com/oxy/GUID-2ACDA668-0A87-46A1-B7FC-9DC74A5461AD-en-US-2/GUID-8B886730-68A6-475B-9223-1D6E6C8BAAE0.html).

pelwell avatar Jun 17 '25 16:06 pelwell

Is the gmac based on the same cadence ip block used by rp1?

Found a datasheet for the cadence gem ([1]) but unfortunately the register details seems to be part of the not public users manual.

[1] https://pix-server-sorel.luoss.fr/Manual/Pi/GigabitEthernetMAC%28GEM%29-TechnicalDataSheet-Cadence.pdf

nbuchwitz avatar Jun 17 '25 18:06 nbuchwitz

I have no knowledge of the GMAC, but the register definitions and the wording of the datasheet match, so draw your own conclusions.

pelwell avatar Jun 17 '25 18:06 pelwell

Conclusions have been drawn and I have the feeling that we're getting into the right direction:

pi@RevPi148195:~$ cat /sys/class/net/eth0/device/modalias 
of:NethernetT(null)Craspberrypi,rp1-gemCcdns,macb
pi@RevPi148195:~$ sudo ethtool --show-eee eth0
EEE settings for eth0:
        EEE status: enabled - active
        Tx LPI: disabled
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  100baseT/Full
                                                 1000baseT/Full

nbuchwitz avatar Jun 17 '25 21:06 nbuchwitz

While working on the upstream patch, I've stumbled over something, when I was debugging a strange LPI error. Phy statistics reported by ethtool shows that phy_local_rcvr_nok is steadily incrementing on systems which seem to be affected by the initial reported issue. Other systems show a constant 1 or less.

@iory123 can you please share the output of sudo ethtool --phy-statistics eth0 on your affected Pi? Want to confirm wether this is a red hering or some hint on what's actually broken.

nbuchwitz avatar Jun 20 '25 21:06 nbuchwitz

While working on the upstream patch, I've stumbled over something, when I was debugging a strange LPI error. Phy statistics reported by ethtool shows that phy_local_rcvr_nok is steadily incrementing on systems which seem to be affected by the initial reported issue. Other systems show a constant 1 or less.

@iory123 can you please share the output of sudo ethtool --phy-statistics eth0 on your affected Pi? Want to confirm wether this is a red hering or some hint on what's actually broken.

PHY statistics: phy_receive_errors: 0 phy_serdes_ber_errors: 0 phy_false_carrier_sense_errors: 0 phy_local_rcvr_nok: 510 phy_remote_rcv_nok: 0 phy_lpi_count: 7789

iory123 avatar Jun 30 '25 17:06 iory123