netlab icon indicating copy to clipboard operation
netlab copied to clipboard

[BUG] MTU larger than 1500 bytes does not work over Linux bridges

Open ipspace opened this issue 10 months ago • 3 comments

We're wasting tons of time arguing about MTU settings and ignoring the elephant in the room: it won't work unless we adjust the underlying virtualization infrastructure

FWIW, it would be "great fun" troubleshooting BGP sessions or large OSPF networks :((

Lab topology

You can use any two devices as long as one of them is a host (forcing the Linux bridge to be used).

nodes:
  r:
    device: iosv
  h:
    device: linux

links:
- r:
  h:
  mtu: 1600

To Reproduce

Start the lab and try to do ping -s 1500 r from h:(

The "oversized" packets get dropped at the ingress interface of the Linux bridge. Once that MTU is adjusted, ping works.

It looks like we have to set the ":libvirt__mtu" setting in Vagrantfile. I have no idea what happens with vrnetlab-based containers.

ipspace avatar Feb 26 '25 17:02 ipspace

It looks like we have to set the ":libvirt__mtu" setting in Vagrantfile. I have no idea what happens with vrnetlab-based containers.

See https://github.com/srl-labs/containerlab/blob/main/docs/manual/network.md#link-mtu

MTU defaults to 9500, bridge "should" inherit the minimum MTU

jbemmel avatar Feb 26 '25 17:02 jbemmel

Update(s):

  • Libvirt LAN links definitely have a problem
  • UDP tunnels (libvirt P2P links) seem to be OK
  • Pure containers are OK as the change in MTU changes the MTU of the underlying Linux interface
  • vrnetlab containers are OK -- containerlab changes the MTU to 9500, and the corresponding QEMU tap interface has the MTU 65000

The truly bizarre part: it seems like the Linux bridge is doing IPv4 fragmentation while bridging packets. I can't decide whether to be amazed or disgusted.

ipspace avatar Feb 26 '25 17:02 ipspace

The truly bizarre part: it seems like the Linux bridge is doing IPv4 fragmentation while bridging packets. I can't decide whether to be amazed or disgusted.

https://www.spinics.net/lists/netdev/msg596072.html

When the "/proc/sys/net/bridge/bridge-nf-call-iptables" is on, bridge will do defragment at PREROUTING and re-fragment at POSTROUTING. At the re-fragment bridge will check if the max frag size is larger than the bridge's MTU in  br_nf_ip_fragment(), if it is true packets will be dropped. And this patch use the outdev's MTU instead of the bridge's MTU to do the br_nf_ip_fragment.

Could it be br_netfilter doing the (de)fragmentation, in support of firewall filters? See https://github.com/torvalds/linux/blob/master/net/bridge/br_netfilter_hooks.c#L807

jbemmel avatar Feb 26 '25 18:02 jbemmel