podman icon indicating copy to clipboard operation
podman copied to clipboard

bridged networking does not inherit mtu (or provide option to do so)

Open fingon opened this issue 2 years ago • 14 comments

Issue Description

When running in Google cloud, using new c3 instance types, the eth0 mtu is 1460. However, when using rootful networking ( bridged ), podman default is 1500. It would be convenient if the mtu was autodetected OR if it was possible to update it using podman network update.

Steps to reproduce the issue

Steps to reproduce the issue (given fresh installation):

  1. sudo ifconfig eth0 mtu 1460 (if not running on e.g. Google, this is default there)
  2. sudo podman run --rm -it alpine ifconfig eth0

The second step will show that the rootful interface mtu is 1500.

Describe the results you received

1500 mtu.

Describe the results you expected

1460 mtu.

podman info output

host:                            
  arch: amd64          
  buildahVersion: 1.31.2
  cgroupControllers:     
  - cpu                                          
  - memory
  - pids             
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 99.6
    systemPercent: 0.16
    userPercent: 0.24
  cpus: 22
  databaseBackend: boltdb
  distribution:
    distribution: fedora      
    variant: cloud            
    version: "38"
  eventLogger: journald
  freeLocks: 2048
  hostname: mstenber-jumpbox                                
  idMappings:    
    gidmap:  
    - container_id: 0
      host_id: 1002
      size: 1 
    - container_id: 1     
      host_id: 524288
      size: 65536                                          
    uidmap:                        
    - container_id: 0       
      host_id: 1002
      size: 1                
    - container_id: 1          
      host_id: 524288      
      size: 65536          
  kernel: 6.3.12-200.aiven1.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 66659168256              
  memTotal: 92774498304
  networkBackend: netavark                                          
  networkBackendInfo:
    backend: netavark
    dns:           
      package: aardvark-dns-1.7.0-1.fc38.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-1.fc38.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun                 
    package: crun-1.9-1.fc38.x86_64           
    path: /usr/bin/crun                            
    version: |-                               
      crun version 1.9                                            
      commit: a538ac4ea1ff319bcfe2bf81cb5c6f687e2dc9d3                                                 
      rundir: /run/user/1002/crun  
      spec: 1.0.0                
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux                 
  pasta:                               
    executable: /usr/bin/pasta                            
    package: passt-0^20230908.g05627dc-1.fc38.x86_64                 
    version: |                                                           
      pasta 0^20230908.g05627dc-1.fc38.x86_64               
      Copyright Red Hat                               
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.                                            
  remoteSocket:                             
    path: /run/user/1002/podman/podman.sock
  security:                                         
    apparmorEnabled: false                     
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT                                                 
    rootless: true                                 
    seccompEnabled: true                      
    seccompProfilePath: /usr/share/containers/seccomp.json        
    selinuxEnabled: true                                                                               
  serviceIsRemote: false                   
  slirp4netns:                            
    executable: /usr/bin/slirp4netns                                                                   
    package: slirp4netns-1.2.1-1.fc38.x86_64
    version: |-
      slirp4netns version 1.2.1                                                                        
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0                     
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 101351153664
  swapTotal: 101351153664
  uptime: 2h 49m 36.00s (Approximately 0.08 days)
plugins:  
  authorization: null
  log:                  
  - k8s-file       
  - none 
  - passthrough                        
  - journald             
  network:                                   
  - bridge       
  - macvlan          
  - ipvlan             
  volume:            
  - local 
registries:              
  search:      
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io    
  - quay.io            
store:           
  configFile: /home/mstenber/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0        
    running: 0     
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}   
  graphRoot: /home/mstenber/.local/share/containers/storage
  graphRootAllocated: 1055719055360
  graphRootUsed: 21407711232
  graphStatus:     
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /stmp               
  imageStore:      
    number: 2        
  runRoot: /run/user/1002/containers
  transientStore: false
  volumePath: /home/mstenber/.local/share/containers/storage/volumes
version:             
  APIVersion: 4.6.2  
  Built: 1693251588
  BuiltTime: Mon Aug 28 19:39:48 2023          
  GitCommit: ""                             
  GoVersion: go1.20.7            
  Os: linux                              
  OsArch: linux/amd64                 
  Version: 4.6.2

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Google Cloud c3 family instance (but should really apply everywhere)

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

fingon avatar Sep 18 '23 09:09 fingon

We do not auto detect the MTU I don't think it make sense to do so, it would be hard to guess what the right MTU is.

You can create networks with the proper mtu with --opt mtu=<NUM>. You can edit the default network config with something like this: https://github.com/containers/podman/discussions/13488#discussioncomment-2358133

podman network update could be an option but I rather not implement that because it would mean to update all running container interfaces with a new MTU which is far more complex.

Luap99 avatar Sep 18 '23 11:09 Luap99

Unfortunately currently there is no good way to recreate the default network, though, as e.g. podman network rm [-f] does not work with it.

Assuming you don't want to reboot, what is the way to update default network configuration?

fingon avatar Sep 18 '23 13:09 fingon

We do not auto detect the MTU I don't think it make sense to do so, it would be hard to guess what the right MTU is.

Conservative guess is pretty safe (smallest non-loopback mtu), as too small mtu leads at worst just to performance problems but given how badly PMTU works, too large mtu leads to TCP not working.

fingon avatar Sep 18 '23 13:09 fingon

The problem is that this has the potential of breaking existing workloads. If one interface has a lower MTU this does not mean the user wants us to us this MTU. I mean yeah it is properly fine it almost all cases but there is still some risk.

And then when would we check what the lowest MTU is? Given we are not a daemon this likely means looking this up for each container setup which adds additional overhead I like to avoid.


Assuming you don't want to reboot, what is the way to update default network configuration?

As written above: You can edit the default network config with something like this: https://github.com/containers/podman/discussions/13488#discussioncomment-2358133

Luap99 avatar Sep 18 '23 13:09 Luap99

And then when would we check what the lowest MTU is? Given we are not a daemon this likely means looking this up for each container setup which adds additional overhead I like to avoid.

Probably just using the mtu of interface where default route points would be enough.

fingon avatar Sep 18 '23 17:09 fingon

Following in Makefile does what I'd propose as sane default (possibly doing min(1500, detected mtu) if worried about jumboframes, PMTU and internet):

# default podman network bridge mtu (1500) is too high on Google C3
# instance types' network which is 1460; this can be used to fix that
# ( do it before starting any nodes )
#
# This will also work the other way around - if jumboframes are
# supported, it will increase the MTU (but potentially introduce
# problems if PMTU is broken)
.PHONY: fix-mtu
fix-mtu:
	sudo mkdir -p /etc/containers/networks
	podman network inspect podman | \
		jq '.[] + {options: {mtu: "'`ip -j addr show dev \`ip -j -4 route show default | jq -r 'sort_by(.metric) | .[] | .dev' | head -1\` | jq -r '.[0].mtu'`'"}}' | \
		sudo tee /etc/containers/networks/podman.json

fingon avatar Sep 19 '23 09:09 fingon

Maybe we should add a default_mtu option to containers.conf, this mtu will then be used for all networks unless --opt mtu=... was explicitly set. Given we support drop in config files with .d there that would makes changes like that much easier. I think that would be a good middle ground for the time being.

Also look like for the macvlan and ipvlan drivers the kernel already uses the proper MTU from the connected host device instead of the default 1500. So this is likely a problem only effecting the bridge driver.

Luap99 avatar Sep 19 '23 11:09 Luap99

Yes, the other two drivers are fine (I started this issue when I noticed the default case not working but my macvlan was fine). Config file could be good compromise (and better than overriding only the default network I am doing in the ^).

fingon avatar Sep 20 '23 10:09 fingon

sudo mkdir -p /etc/containers/networks podman network inspect podman |
jq '.[] + {options: {mtu: "'ip -j addr show dev \ip -j -4 route show default | jq -r 'sort_by(.metric) | .[] | .dev' | head -1` | jq -r '.[0].mtu'`'"}}' |
sudo tee /etc/containers/networks/podman.json

I was having exactly the same use under kubevirt, the MTU of the VMs interface is 1400, and the default podman network was 1500, so connections didn't work/hang etc sometimes.

I applied the workaround and it helped, what do you think about:

  1. Detecting the issue on podman run, and warning with the possible fix(2), may be suggest the the default route mtu, but we know this is not always the right one, it would be 99-ish of cases?

  2. Providing an option to change the default network mtu?

I understand there is no golden path here, but MTU issues are hard to diagnose and frustrating to users because those are not evident, in most cases the result would be users running away because "it hangs", "it doesn't work", "it's unstable"..

What do you think?

mangelajo avatar Nov 12 '24 07:11 mangelajo

I think using the default route mtu makes certainly sense in almost all cases. In the meantime there was also reported the opposite use case (wanting a higher default mtu to improve performance https://github.com/containers/podman/issues/23883), if we detect the default gateway mtu it will just work in both directions. Doing this by default would be net benefit for users IMO, throwing warnings is always a bit odd because if we know what is wrong then why don't we just fix it?

For cases where our logic might not choose the right default we still have the per network mtu option to overwrite it. Adding a second default_mtu option may be useful as well but there would need to consider https://github.com/containers/podman/issues/23883#issuecomment-2338491674 so the implement is not a simple as I would like.


I don't have much time to work on this though anytime soon though. If you need/want this prioritized please file a RFE in the Red Hat Jira against podman (linking the upstream issue is fine) so our team can consider this during planning.

Luap99 avatar Nov 12 '24 09:11 Luap99

@Luap99 what would be the project in Jira? (I see PODMAND but not sure if it's that one) thanks

mangelajo avatar Nov 12 '24 13:11 mangelajo

@mangelajo use the RHEL project and select podman as component

Luap99 avatar Nov 12 '24 14:11 Luap99

@mangelajo use the RHEL project and select podman as component

done https://issues.redhat.com/browse/RHEL-67298

Thank you!

mangelajo avatar Nov 13 '24 09:11 mangelajo

Popping in here to say that three senior engineers just spent four hours debugging this exact issue, eventually ending up here. The RHEL issue has been closed as "Cannot Reproduce" on the 26th of Feb without comment or elaboration, despite clear reproduction instructions in the report. (Well, they're clear to me anyway - @Luap99 is there more information that could be provided that would help the Red Hat folks reproduce?)

For what it's worth, @fingon's workaround works perfectly, but one has to debug as far as finding this issue page first :-/

meredydd avatar Mar 21 '25 18:03 meredydd