lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Network leak: Persistent accumulation of ESTAB TCP connections on port 8443 after lxc copy --refresh between hosts with ZFS

Open bvasiliev opened this issue 1 month ago • 2 comments

Please confirm

  • [x] I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

Ubuntu

Distribution version

24.04

Output of "snap list --all lxd core20 core22 core24 snapd"

vm-host-2:~$ snap list --all lxd core20 core22 core24 snapd
Name    Version      Rev    Tracking       Publisher   Notes
core22  20250923     2139   latest/stable  canonical✓  base,disabled
core22  20251009     2163   latest/stable  canonical✓  base
core24  20250829     1196   latest/stable  canonical✓  base,disabled
core24  20251001     1225   latest/stable  canonical✓  base
lxd     6.5-22da890  35616  latest/stable  canonical✓  disabled,in-cohort
lxd     6.5-ccdfb39  36020  latest/stable  canonical✓  in-cohort
snapd   2.71         25202  latest/stable  canonical✓  snapd,disabled
snapd   2.72         25577  latest/stable  canonical✓  snapd

Output of "lxc info" or system info if it fails

vm-host-2:~$ lxc info
config:
  cluster.https_address: 192.168.1.250:8443
  cluster.images_minimal_replica: "1"
  core.https_address: 0.0.0.0:8443
  images.auto_update_cached: "false"
  images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- storage_api_remote_volume_snapshot_copy
- zfs_delegate
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- init_preseed_storage_volumes
- metrics_instances_count
- server_instance_type_info
- resources_disk_mounted
- server_version_lts
- oidc_groups_claim
- loki_config_instance
- storage_volatile_uuid
- import_instance_devices
- instances_uefi_vars
- instances_migration_stateful
- container_syscall_filtering_allow_deny_syntax
- access_management
- vm_disk_io_limits
- storage_volumes_all
- instances_files_modify_permissions
- image_restriction_nesting
- container_syscall_intercept_finit_module
- device_usb_serial
- network_allocate_external_ips
- explicit_trust_token
- shared_custom_block_volumes
- instance_import_conversion
- instance_create_start
- instance_protection_start
- devlxd_images_vm
- disk_io_bus_virtio_blk
- metrics_api_requests
- projects_limits_disk_pool
- ubuntu_pro_guest_attach
- metadata_configuration_entity_types
- access_management_tls
- network_allocations_ovn_uplink
- network_ovn_uplink_vlan
- state_logical_cpus
- vm_limits_cpu_pin_strategy
- gpu_cdi
- images_all_projects
- metadata_configuration_scope
- unix_device_hotplug_ownership_inherit
- unix_device_hotplug_subsystem_device_option
- storage_ceph_osd_pool_size
- network_get_target
- network_zones_all_projects
- vm_root_volume_attachment
- projects_limits_uplink_ips
- entities_with_entitlements
- profiles_all_projects
- storage_driver_powerflex
- storage_driver_pure
- cloud_init_ssh_keys
- oidc_scopes
- project_default_network_and_storage
- client_cert_presence
- clustering_groups_used_by
- container_bpf_delegation
- override_snapshot_profiles_on_copy
- resources_device_fs_uuid
- backup_metadata_version
- storage_buckets_all_projects
- network_acls_all_projects
- networks_all_projects
- clustering_restore_skip_mode
- disk_io_threads_virtiofsd
- oidc_client_secret
- pci_hotplug
- device_patch_removal
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
client_certificate: false
auth_user_name: <>
auth_user_method: unix
environment:
  addresses:
  - 192.168.1.250:8443
  - 172.16.16.250:8443
  architectures:
  - x86_64
  - i686
  backup_metadata_version_range:
  - 1
  - 2
  certificate: |
    -----BEGIN CERTIFICATE-----
    -----END CERTIFICATE-----
  certificate_fingerprint: <>
  driver: lxc | qemu
  driver_version: 6.0.4 | 8.2.2
  instance_types:
  - container
  - virtual-machine
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    bpf_token: "false"
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.8.0-87-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "24.04"
  project: default
  server: lxd
  server_clustered: true
  server_event_mode: full-mesh
  server_name: vm-host-2
  server_pid: 298456
  server_version: "6.5"
  server_lts: false
  storage: zfs
  storage_version: 2.2.2-0ubuntu9.4
  storage_supported_drivers:
  - name: btrfs
    version: 6.6.3
    remote: false
  - name: ceph
    version: 19.2.1
    remote: true
  - name: powerflex
    version: 2.8 (nvme-cli)
    remote: true
  - name: pure
    version: 2.1.9 (iscsiadm) / 2.8 (nvme-cli)
    remote: true
  - name: zfs
    version: 2.2.2-0ubuntu9.4
    remote: false
  - name: cephfs
    version: 19.2.1
    remote: true
  - name: cephobject
    version: 19.2.1
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: false

Issue description

In a LXD cluster using ZFS storage without shared block, running lxc copy ... --refresh for incremental container replication causes a persistent accumulation of Established (ESTAB) TCP connections on the service port 8443. This resource leak leads to a permanent, cumulative increase in background network traffic between cluster members, growing by approximately 10+ KB/s after each copy operation. With prolonged uptime or frequent copying, this accumulation eventually triggers cluster synchronization warnings.

My cluster interconnect interface bandwidth: Image

Sockstat: Image

The drops in the graphs are caused by LXD reloads while troubleshooting. The graphs show low values, but previously with uptime for several months, I've seen tens of megabits.

How it looks:

vm-host-2:~$ ss -tanp | grep 8443 | grep -c ESTAB
75
vm-host-2:~$ lxc copy log-primus log-secundus --verbose --stateless --target vm-host-1 --refresh
vm-host-2:~$ ss -tanp | grep 8443 | grep -c ESTAB
76
vm-host-2:~$ lxc copy log-primus log-secundus --verbose --stateless --target vm-host-1 --refresh
vm-host-2:~$ ss -tanp | grep 8443 | grep -c ESTAB
78

lxc monitor during operations: lxc.monitor.debug.log

And related configs: lxc storage show zpool.txt lxc config show instance.txt

Steps to reproduce

  1. Create a ZFS storage pool in the cluster, backed by local storage on each cluster host.
  2. Create a container instance using this ZFS pool on a source node.
  3. Create a snapshot of the instance (manually or scheduled).
  4. Check the current count of established TCP connections on the source host: ss -tanp | grep 8443 | grep -c ESTAB
  5. Run the incremental copy command to a target node: lxc copy <source-container> <target-container> --verbose --stateless --target <target-node> --refresh
  6. Check the established connections count again to confirm the increase: ss -tanp | grep 8443 | grep -c ESTAB

Information to attach

  • [ ] Any relevant kernel output (dmesg)
  • [ ] Instance log (lxc info NAME --show-log)
  • [x] Instance configuration (lxc config show NAME --expanded)
  • [ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • [ ] Output of the client with --debug
  • [x] Output of the daemon with --debug (or use lxc monitor while reproducing the issue)

bvasiliev avatar Nov 18 '25 07:11 bvasiliev

Would it be possible for you to test (on a non-production system) where you're seeing this on the latest/edge channel as that will become LXD 6.6 soon. Thanks!

tomponline avatar Nov 18 '25 10:11 tomponline

The issue is reproducible on a fresh installation of both latest/stable and latest/edge.

vm-test-1:~# snap list --all lxd
Name  Version      Rev    Tracking     Publisher   Notes
lxd   6.5-ccdfb39  36020  latest/edge  canonical✓  disabled
lxd   git-7c2b109  36693  latest/edge  canonical✓  -

vm-test-1:~# ss -tanp | grep 8443 | grep -c ESTAB
11

vm-test-1:~# lxc copy test-primus test-secundus --verbose --stateless --target vm-test-2 --refresh
vm-test-1:~# ss -tanp | grep 8443 | grep -c ESTAB
13

vm-test-1:~# lxc copy test-primus test-secundus --verbose --stateless --target vm-test-2 --refresh
vm-test-1:~# ss -tanp | grep 8443 | grep -c ESTAB
15

# after 10 mins
vm-test-1:~# ss -tanp | grep 8443 | grep -c ESTAB
14

bvasiliev avatar Nov 19 '25 11:11 bvasiliev