Default 1500 MTU on managed bridge networks isn't always appropriate
- Distribution: 22.04.3
- Distribution version:
- The output of "lxc info" or if that fails:
rajan_patel@landscapebeta:~$ lxc info
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- vsock_api
- storage_volumes_all_projects
- projects_networks_restricted_access
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- cpu_hotplug
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
addresses: []
architectures:
- x86_64
- i686
certificate: |
-----BEGIN CERTIFICATE-----
MIICRjCCAcugAwIBAgIQJsIOsPR7+u14DifZdk3UIzAKBggqhkjOPQQDAzBKMRww
GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMSowKAYDVQQDDCFyb290QGxhbmRz
Y2FwZWJldGEucmFqYW5wYXRlbC5jb20wHhcNMjQwMjAxMjI1MzMwWhcNMzQwMTI5
MjI1MzMwWjBKMRwwGgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMSowKAYDVQQD
DCFyb290QGxhbmRzY2FwZWJldGEucmFqYW5wYXRlbC5jb20wdjAQBgcqhkjOPQIB
BgUrgQQAIgNiAASQ01+gAzc6jOx6RjJXK5XHOagQ08AhxW5MqwT8/tpFILs+tzAh
UT8SLeEYkz8xCQ+DSq1XX3p/zUqFa2ThUapJLR4gB6EZOrHkOV0loZGrCz345Qs3
j7glW6RpMg7NZEqjdjB0MA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEF
BQcDATAMBgNVHRMBAf8EAjAAMD8GA1UdEQQ4MDaCHGxhbmRzY2FwZWJldGEucmFq
YW5wYXRlbC5jb22HBH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMD
aQAwZgIxAOYBuknbQggFBVkitjDP6p13RZ1cfReY9YOQ46/gOgkMad+EUua3f2c+
U4iuC3T0rQIxAPz7s6btNdRpdxQkGk0FjFCMox6JN2hkFfA9/ulpCjgoErZAkiq9
8EpV+qljsaWcWw==
-----END CERTIFICATE-----
certificate_fingerprint: 878f691c2fbf2b7263913be5961cde77acb8f9cc1e4b98d31b896305d2cf60f6
driver: lxc
driver_version: 5.0.2
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
shiftfs: "false"
uevent_injection: "true"
unpriv_fscaps: "true"
kernel_version: 6.2.0-1019-gcp
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "22.04"
project: default
server: lxd
server_clustered: false
server_event_mode: full-mesh
server_name: landscapebeta.rajanpatel.com
server_pid: 2190
server_version: 5.0.2
storage: dir
storage_version: "1"
storage_supported_drivers:
- name: ceph
version: 15.2.17
remote: true
- name: cephfs
version: 15.2.17
remote: true
- name: cephobject
version: 15.2.17
remote: true
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.47.0
remote: false
- name: zfs
version: 2.1.9-2ubuntu1.1
remote: false
- name: btrfs
version: 5.4.1
remote: false
Issue description
When running lxd init --auto a lxdbr0 interface is created, and all LXD containers will have network interfaces on lxdbr0. LXD hardcodes the MTU on lxdbr0 at 1500, which is a sensible default, but causes problems or inefficiencies when the MTU on the host machine’s default network adapter is different. On Oracle Cloud the network adapter is configured for jumbo frames, and has an MTU of 9000. On Google Cloud the MTU on the default adapter is 1460.
rajan_patel@landscapebeta:~$ lxd init --auto
rajan_patel@landscapebeta:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:80:00:20 brd ff:ff:ff:ff:ff:ff
inet 10.128.0.32/32 metric 100 scope global dynamic ens4
valid_lft 86295sec preferred_lft 86295sec
inet6 fe80::4001:aff:fe80:20/64 scope link
valid_lft forever preferred_lft forever
3: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 00:16:3e:50:44:e6 brd ff:ff:ff:ff:ff:ff
inet 10.145.247.1/24 scope global lxdbr0
valid_lft forever preferred_lft forever
inet6 fd42:f052:4168:f2c2::1/64 scope global
valid_lft forever preferred_lft forever
As a result, LXD containers are unable to access the Internet. This breaks apt update, snap install, and effectively creates an airgapped container. What is worse, the issue is completely undocumented, so it requires a lot of digging and requires an understanding of networking to figure out the solution.
Anybody using LXD on Google Cloud is adversely impacted by this. They have a page dedicated to explaining the issues of mismatched MTUs here: https://cloud.google.com/vpc/docs/mtu
Steps to reproduce
- Step one:
lxd init --auto - Step two:
ip a - Step three: observe the lxdbr0 interface has an MTU of 1500, regardless of what MTU the physical NIC the bridge is created on, is configured to.
Information to attach
lxd init --auto should create the lxdbr0 interface with an MTU that matches the default network adapter on the host machine. This is manually achieved today via 2 commands:
# identify the default network adapter on the machine, the next command like check the MTU configuration on this adapter
read -r INTERFACE < <(ip route | awk '$1=="default"{print $5; exit}')
# if your network uses jumbo frames (MTU 9000), or an MTU smaller than 1500 (as found on Google Cloud VMs), use a matching MTU on lxdbr0 (which is created by lxd init --auto)
lxc network set lxdbr0 bridge.mtu=$(ip link show $INTERFACE | awk '/mtu/ {print $5}')
When using LXD as a stepping stone for trying other Canonical software, we have to explain these MTU pitfalls. It makes LXD look unrefined and unnecessarily complex when adding these steps to a “quickstart” or “getting started” how-to: https://gist.github.com/rajannpatel/cdc43b30a863824b139fb7a18f2e99a5
This breaks apt update, snap install, and effectively creates an airgapped container.
This is interesting because most home internet connections use PPPoE tunneling that also reduces the MTU to the internet to less than 1500, and yet the internal network still commonly uses 1500 MTU on all devices and the internet connections are still working.
Normally this is because the ISP's router or network performs TCP MSS clamping
https://www.cloudflare.com/en-gb/learning/network-layer/what-is-mss/
So presumably in these environments no such clamping is being applied.
Additionally it may also be because these provider networks (or your particular local firewall setup) is blocking PMTU:
https://en.wikipedia.org/wiki/Path_MTU_Discovery
In principle your proposal sounds like a good idea, I will consider if there are any downsides or possible regressions making this change would introduce.
@rajannpatel interestingly the GCP doc you linked to says:
TCP SYN and SYN-ACK packets Google Cloud performs MSS clamping if necessary, changing the MSS to ensure packets fits within the MTU.
So this makes me wonder why you are experiencing these issues?
Can you advise further what the specific problem is, is it UDP traffic (DNS perhaps) that is causing the problem? As MSS clamping only affects TCP.
Is it possible to provider reproducer steps using just LXD without landscape, i.e lxc launch ... and then lxc exec with commands that demonstrate the problem?