lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Default 1500 MTU on managed bridge networks isn't always appropriate

Open rajannpatel opened this issue 1 year ago • 3 comments

  • Distribution: 22.04.3
  • Distribution version:
  • The output of "lxc info" or if that fails:
rajan_patel@landscapebeta:~$ lxc info
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- vsock_api
- storage_volumes_all_projects
- projects_networks_restricted_access
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- cpu_hotplug
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICRjCCAcugAwIBAgIQJsIOsPR7+u14DifZdk3UIzAKBggqhkjOPQQDAzBKMRww
    GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMSowKAYDVQQDDCFyb290QGxhbmRz
    Y2FwZWJldGEucmFqYW5wYXRlbC5jb20wHhcNMjQwMjAxMjI1MzMwWhcNMzQwMTI5
    MjI1MzMwWjBKMRwwGgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMSowKAYDVQQD
    DCFyb290QGxhbmRzY2FwZWJldGEucmFqYW5wYXRlbC5jb20wdjAQBgcqhkjOPQIB
    BgUrgQQAIgNiAASQ01+gAzc6jOx6RjJXK5XHOagQ08AhxW5MqwT8/tpFILs+tzAh
    UT8SLeEYkz8xCQ+DSq1XX3p/zUqFa2ThUapJLR4gB6EZOrHkOV0loZGrCz345Qs3
    j7glW6RpMg7NZEqjdjB0MA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEF
    BQcDATAMBgNVHRMBAf8EAjAAMD8GA1UdEQQ4MDaCHGxhbmRzY2FwZWJldGEucmFq
    YW5wYXRlbC5jb22HBH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMD
    aQAwZgIxAOYBuknbQggFBVkitjDP6p13RZ1cfReY9YOQ46/gOgkMad+EUua3f2c+
    U4iuC3T0rQIxAPz7s6btNdRpdxQkGk0FjFCMox6JN2hkFfA9/ulpCjgoErZAkiq9
    8EpV+qljsaWcWw==
    -----END CERTIFICATE-----
  certificate_fingerprint: 878f691c2fbf2b7263913be5961cde77acb8f9cc1e4b98d31b896305d2cf60f6
  driver: lxc
  driver_version: 5.0.2
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.2.0-1019-gcp
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: landscapebeta.rajanpatel.com
  server_pid: 2190
  server_version: 5.0.2
  storage: dir
  storage_version: "1"
  storage_supported_drivers:
  - name: ceph
    version: 15.2.17
    remote: true
  - name: cephfs
    version: 15.2.17
    remote: true
  - name: cephobject
    version: 15.2.17
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.47.0
    remote: false
  - name: zfs
    version: 2.1.9-2ubuntu1.1
    remote: false
  - name: btrfs
    version: 5.4.1
    remote: false

Issue description

When running lxd init --auto a lxdbr0 interface is created, and all LXD containers will have network interfaces on lxdbr0. LXD hardcodes the MTU on lxdbr0 at 1500, which is a sensible default, but causes problems or inefficiencies when the MTU on the host machine’s default network adapter is different. On Oracle Cloud the network adapter is configured for jumbo frames, and has an MTU of 9000. On Google Cloud the MTU on the default adapter is 1460.

rajan_patel@landscapebeta:~$ lxd init --auto
rajan_patel@landscapebeta:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:80:00:20 brd ff:ff:ff:ff:ff:ff
    inet 10.128.0.32/32 metric 100 scope global dynamic ens4
       valid_lft 86295sec preferred_lft 86295sec
    inet6 fe80::4001:aff:fe80:20/64 scope link 
       valid_lft forever preferred_lft forever
3: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:16:3e:50:44:e6 brd ff:ff:ff:ff:ff:ff
    inet 10.145.247.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:f052:4168:f2c2::1/64 scope global 
       valid_lft forever preferred_lft forever

As a result, LXD containers are unable to access the Internet. This breaks apt update, snap install, and effectively creates an airgapped container. What is worse, the issue is completely undocumented, so it requires a lot of digging and requires an understanding of networking to figure out the solution.

Anybody using LXD on Google Cloud is adversely impacted by this. They have a page dedicated to explaining the issues of mismatched MTUs here: https://cloud.google.com/vpc/docs/mtu

Steps to reproduce

  1. Step one: lxd init --auto
  2. Step two: ip a
  3. Step three: observe the lxdbr0 interface has an MTU of 1500, regardless of what MTU the physical NIC the bridge is created on, is configured to.

Information to attach

lxd init --auto should create the lxdbr0 interface with an MTU that matches the default network adapter on the host machine. This is manually achieved today via 2 commands:

# identify the default network adapter on the machine, the next command like check the MTU configuration on this adapter
read -r INTERFACE < <(ip route | awk '$1=="default"{print $5; exit}')

# if your network uses jumbo frames (MTU 9000), or an MTU smaller than 1500 (as found on Google Cloud VMs), use a matching MTU on lxdbr0 (which is created by lxd init --auto)
lxc network set lxdbr0 bridge.mtu=$(ip link show $INTERFACE | awk '/mtu/ {print $5}')

When using LXD as a stepping stone for trying other Canonical software, we have to explain these MTU pitfalls. It makes LXD look unrefined and unnecessarily complex when adding these steps to a “quickstart” or “getting started” how-to: https://gist.github.com/rajannpatel/cdc43b30a863824b139fb7a18f2e99a5

rajannpatel avatar Feb 01 '24 23:02 rajannpatel

This breaks apt update, snap install, and effectively creates an airgapped container.

This is interesting because most home internet connections use PPPoE tunneling that also reduces the MTU to the internet to less than 1500, and yet the internal network still commonly uses 1500 MTU on all devices and the internet connections are still working.

Normally this is because the ISP's router or network performs TCP MSS clamping

https://www.cloudflare.com/en-gb/learning/network-layer/what-is-mss/

So presumably in these environments no such clamping is being applied.

Additionally it may also be because these provider networks (or your particular local firewall setup) is blocking PMTU:

https://en.wikipedia.org/wiki/Path_MTU_Discovery

In principle your proposal sounds like a good idea, I will consider if there are any downsides or possible regressions making this change would introduce.

tomponline avatar Feb 02 '24 11:02 tomponline

@rajannpatel interestingly the GCP doc you linked to says:

TCP SYN and SYN-ACK packets Google Cloud performs MSS clamping if necessary, changing the MSS to ensure packets fits within the MTU.

So this makes me wonder why you are experiencing these issues?

Can you advise further what the specific problem is, is it UDP traffic (DNS perhaps) that is causing the problem? As MSS clamping only affects TCP.

tomponline avatar Feb 02 '24 11:02 tomponline

Is it possible to provider reproducer steps using just LXD without landscape, i.e lxc launch ... and then lxc exec with commands that demonstrate the problem?

tomponline avatar Feb 02 '24 11:02 tomponline