lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Container stop/restart hanging after refresh from LXD 5.0.3 to LXD 5.21.1

Open mszywala opened this issue 1 year ago • 13 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: 22.04
  • The output of "snap list --all lxd core20 core22 core24 snapd":
core20  20230801        2015   latest/stable  canonical✓  base,disabled
core20  20240227        2264   latest/stable  canonical✓  base
core22  20240111        1122   latest/stable  canonical✓  base
lxd     5.0.3-babaaf8   27948  5.21/stable    canonical✓  disabled
lxd     5.21.1-3f3c2a9  28156  5.21/stable    canonical✓  -
snapd   2.60.2          19993  latest/stable  canonical✓  snapd,disabled
snapd   2.61.2          21184  latest/stable  canonical✓  snapd
  • The output of "lxc info" or if that fails:
    • Kernel version: 5.15.0-102-generic
    • LXC version: 5.0.3 / 5.21.1 LTS
    • LXD version: 5.0.3 / 5.21.1 LTS
    • Storage backend in use: dir

Output of "lxc info"

config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- storage_api_remote_volume_snapshot_copy
- zfs_delegate
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- init_preseed_storage_volumes
- metrics_instances_count
- server_instance_type_info
- resources_disk_mounted
- server_version_lts
- oidc_groups_claim
- loki_config_instance
- storage_volatile_uuid
- import_instance_devices
- instances_uefi_vars
- instances_migration_stateful
- container_syscall_filtering_allow_deny_syntax
- access_management
- vm_disk_io_limits
- storage_volumes_all
- instances_files_modify_permissions
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
 addresses: []
 architectures:
 - x86_64
 - i686
 certificate: |
   -----BEGIN CERTIFICATE-----
   CENSORED
   -----END CERTIFICATE-----
 certificate_fingerprint: CENSORED
 driver: lxc
 driver_version: 6.0.0
 instance_types:
 - container
 firewall: nftables
 kernel: Linux
 kernel_architecture: x86_64
 kernel_features:
   idmapped_mounts: "true"
   netnsid_getifaddrs: "true"
   seccomp_listener: "true"
   seccomp_listener_continue: "true"
   uevent_injection: "true"
   unpriv_fscaps: "true"
 kernel_version: 5.15.0-102-generic
 lxc_features:
   cgroup2: "true"
   core_scheduling: "true"
   devpts_fd: "true"
   idmapped_mounts_v2: "true"
   mount_injection_file: "true"
   network_gateway_device_route: "true"
   network_ipvlan: "true"
   network_l2proxy: "true"
   network_phys_macvlan_mtu: "true"
   network_veth_router: "true"
   pidfd: "true"
   seccomp_allow_deny_syntax: "true"
   seccomp_notify: "true"
   seccomp_proxy_send_notify_fd: "true"
 os_name: Ubuntu
 os_version: "22.04"
 project: default
 server: lxd
 server_clustered: false
 server_event_mode: full-mesh
 server_name: ubuntu-jammy
 server_pid: 6383
 server_version: 5.21.1
 server_lts: true
 storage: dir
 storage_version: "1"
 storage_supported_drivers:
 - name: dir
   version: "1"
   remote: false
 - name: lvm
   version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
   remote: false
 - name: powerflex
   version: 1.16 (nvme-cli)
   remote: true
 - name: zfs
   version: 2.1.5-1ubuntu6~22.04.3
   remote: false
 - name: btrfs
   version: 5.16.2
   remote: false
 - name: ceph
   version: 17.2.7
   remote: true
 - name: cephfs
   version: 17.2.7
   remote: true
 - name: cephobject
   version: 17.2.7
   remote: true

Issue description

We're on the way to upgrading our servers from LXD 5.0.3 to LXD 5.21.1 LTS. I somehow experienced the issue that the stop and restart container operations were hanging after the refresh. I only encountered the issue for the first stop/restart after snap refresh lxd --channel=5.21/stable It seems not to make a difference if -f is used to stop or restart the instance.

After a while, the instance stops with the following error:

Error: Failed shutting down instance, status is "Running": context deadline exceeded
Try `lxc info --show-log u` for more info

Additionally, we are using a routed nic for our container instances.

Because the instance is not stopping properly, there are remaining routes and the veth Device of the instance is in the state "LOWERLAYERDOWN" and the instance is refusing to start back up. When we manage to shut down an instance, it is necessary to take down the veth interface of the instance with ip link set vethxxxx down, and we also have to clean up the remaining routes on the hostsystem

Expectation: lxc stop and lxc restart is working properly after a refresh.

Steps to reproduce

  1. Ubuntu 22.04 with LXD 5.0.3
  2. Create a container
  3. Perform a refresh snap refresh lxd --channel=5.21/stable
  4. Stop/Restart the container lxc stop <containerName>

Information to attach

  • [X] Any relevant kernel output (dmesg) Nothing relevant here
  • [X] Container log (lxc info NAME --show-log)
Name: u
Status: RUNNING
Type: container
Architecture: x86_64
PID: -1
Created: 2024/04/15 09:15 UTC
Last Used: 2024/04/15 12:07 UTC

Resources:
 Processes: 0
 CPU usage:
   CPU usage (in seconds): 0

Log:

lxc u 20240415120709.358 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc u 20240415120709.358 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc u 20240415120709.359 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc u 20240415120709.359 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc u 20240415120831.600 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc u 20240415120831.601 ERROR    start - ../src/src/lxc/start.c:lxc_end:944 - Failed to run "lxc.hook.stop" hook
lxc u 20240415120832.126 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc u 20240415120832.126 ERROR    start - ../src/src/lxc/start.c:lxc_end:985 - Failed to run lxc.hook.post-stop for container "u"

  • [X] Container configuration (lxc config show NAME --expanded)
architecture: x86_64
config:
 image.architecture: amd64
 image.description: ubuntu 22.04 LTS amd64 (release) (20240319)
 image.label: release
 image.os: ubuntu
 image.release: jammy
 image.serial: "20240319"
 image.type: squashfs
 image.version: "22.04"
 volatile.base_image: ca08f2f359079cdb7fb2d4efa882df5974304d7ae9bb13cbac4904a1347626be
 volatile.cloud-init.instance-id: fd955459-ecfc-4d00-932e-896aa690f8f5
 volatile.idmap.base: "0"
 volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
 volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
 volatile.last_state.idmap: '[]'
 volatile.last_state.power: RUNNING
 volatile.uuid: 72c6dd51-b44c-42ef-ac72-4064385a644e
 volatile.uuid.generation: 72c6dd51-b44c-42ef-ac72-4064385a644e
devices:
 root:
   path: /
   pool: default
   type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
  • [X] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log) Nothing relevant in daemon log
  • [X] Output of the client with --debug
  • [X] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue) Output after refreshing to LXD 5.21.1 LTS, trying to stop (restart is reproducing the same behaviour) the instance
location: none
metadata:
 context:
   id: c18351eb-9c2f-472f-9da1-956ccf8e8515
   local: /var/snap/lxd/common/lxd/unix.socket
   remote: '@'
 level: debug
 message: Event listener server handler started
timestamp: "2024-04-15T12:08:22.460119443Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: GET
   protocol: unix
   url: /1.0
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:08:28.290919569Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: GET
   protocol: unix
   url: /1.0/events
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:08:28.292459798Z"
type: logging


location: none
metadata:
 context:
   id: 7206ca2e-6647-493d-acb0-9f2b739ad80f
   local: /var/snap/lxd/common/lxd/unix.socket
   remote: '@'
 level: debug
 message: Event listener server handler started
timestamp: "2024-04-15T12:08:28.292742589Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: PUT
   protocol: unix
   url: /1.0/instances/u/state
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:08:28.293482232Z"
type: logging


location: none
metadata:
 class: task
 created_at: "2024-04-15T12:08:28.29445839Z"
 description: Stopping instance
 err: ""
 id: f74b35a7-28f7-4da7-a3a2-945f79b9d29a
 location: none
 may_cancel: false
 metadata: null
 resources:
   instances:
   - /1.0/instances/u
 status: Pending
 status_code: 105
 updated_at: "2024-04-15T12:08:28.29445839Z"
project: default
timestamp: "2024-04-15T12:08:28.29604009Z"
type: operation


location: none
metadata:
 context:
   class: task
   description: Stopping instance
   operation: f74b35a7-28f7-4da7-a3a2-945f79b9d29a
   project: default
 level: debug
 message: New operation
timestamp: "2024-04-15T12:08:28.296017929Z"
type: logging


location: none
metadata:
 class: task
 created_at: "2024-04-15T12:08:28.29445839Z"
 description: Stopping instance
 err: ""
 id: f74b35a7-28f7-4da7-a3a2-945f79b9d29a
 location: none
 may_cancel: false
 metadata: null
 resources:
   instances:
   - /1.0/instances/u
 status: Running
 status_code: 103
 updated_at: "2024-04-15T12:08:28.29445839Z"
project: default
timestamp: "2024-04-15T12:08:28.296054612Z"
type: operation


location: none
metadata:
 context:
   class: task
   description: Stopping instance
   operation: f74b35a7-28f7-4da7-a3a2-945f79b9d29a
   project: default
 level: debug
 message: Started operation
timestamp: "2024-04-15T12:08:28.29604927Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: GET
   protocol: unix
   url: /1.0/operations/f74b35a7-28f7-4da7-a3a2-945f79b9d29a
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:08:28.297029355Z"
type: logging


location: none
metadata:
 context:
   instance: u
   instanceType: container
   project: default
   timeout: 10m0s
 level: debug
 message: Shutdown started
timestamp: "2024-04-15T12:08:28.305543127Z"
type: logging


location: none
metadata:
 context:
   action: shutdown
   created: 2024-04-15 09:15:51.927293086 +0000 UTC
   ephemeral: "false"
   instance: u
   instanceType: container
   project: default
   timeout: 10m0s
   used: 2024-04-15 12:07:09.356193457 +0000 UTC
 level: info
 message: Shutting down instance
timestamp: "2024-04-15T12:08:28.305832124Z"
type: logging


location: none
metadata:
 context:
   action: stop
   instance: u
   project: default
   reusable: "true"
 level: debug
 message: Instance operation lock created
timestamp: "2024-04-15T12:08:28.305813353Z"
type: logging


location: none
metadata:
 context:
   instance: u
   instanceType: container
   project: default
 level: debug
 message: Shutdown request sent to instance
timestamp: "2024-04-15T12:08:28.306351356Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: GET
   protocol: unix
   url: /1.0
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:10:38.268495437Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: GET
   protocol: unix
   url: /1.0/instances/u?recursion=1
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:10:38.269666709Z"
type: logging


location: none
metadata:
 context:
   driver: dir
   instance: u
   pool: default
   project: default
 level: debug
 message: GetInstanceUsage started
timestamp: "2024-04-15T12:10:38.275039775Z"
type: logging


location: none
metadata:
 context:
   driver: dir
   instance: u
   pool: default
   project: default
 level: debug
 message: GetInstanceUsage finished
timestamp: "2024-04-15T12:10:38.275631324Z"
type: logging


location: none
metadata:
 context:
   ip: '@'
   method: GET
   protocol: unix
   url: /1.0/instances/u/logs/lxc.log
   username: root
 level: debug
 message: Handling API request
timestamp: "2024-04-15T12:10:38.294387288Z"
type: logging


location: none
metadata:
 action: instance-log-retrieved
 requestor:
   address: '@'
   protocol: unix
   username: root
 source: /1.0/instances/u/backups/lxc.log
project: default
timestamp: "2024-04-15T12:10:38.295641303Z"
type: lifecycle


location: none
metadata:
 class: task
 created_at: "2024-04-15T12:08:28.29445839Z"
 description: Stopping instance
 err: 'Failed shutting down instance, status is "Running": context deadline exceeded'
 id: f74b35a7-28f7-4da7-a3a2-945f79b9d29a
 location: none
 may_cancel: false
 metadata: null
 resources:
   instances:
   - /1.0/instances/u
 status: Failure
 status_code: 400
 updated_at: "2024-04-15T12:08:28.29445839Z"
project: default
timestamp: "2024-04-15T12:18:28.30729922Z"
type: operation


location: none
metadata:
 context:
   instance: u
   instanceType: container
   project: default
   timeout: 10m0s
 level: debug
 message: Shutdown finished
timestamp: "2024-04-15T12:18:28.307150063Z"
type: logging


location: none
metadata:
 context:
   class: task
   description: Stopping instance
   err: 'Failed shutting down instance, status is "Running": context deadline exceeded'
   operation: f74b35a7-28f7-4da7-a3a2-945f79b9d29a
   project: default
 level: debug
 message: Failure for operation
timestamp: "2024-04-15T12:18:28.307227229Z"
type: logging


location: none
metadata:
 context:
   listener: 7206ca2e-6647-493d-acb0-9f2b739ad80f
   local: /var/snap/lxd/common/lxd/unix.socket
   remote: '@'
 level: debug
 message: Event listener server handler stopped
timestamp: "2024-04-15T12:18:28.312781582Z"
type: logging

mszywala avatar Apr 15 '24 12:04 mszywala

Any updates on this topic?

mszywala avatar Jun 24 '24 13:06 mszywala

It seems that the Shutdown function is halting here, probably because LXD is not receiving the appropriate hook to finish the operation (neither the stop hook nor the stop namespace). Is is still to be determined what is the inconsistency on the hook.

hamistao avatar Jul 15 '24 14:07 hamistao

Took a look at this and found this in /var/snap/lxd/common/lxd/logs/c1/lxc.log after setting sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon:

lxc c1 20240719091404.652 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /snap/lxd/current/bin/lxd)

lxc c1 20240719091404.652 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /snap/lxd/current/bin/lxd)

lxc c1 20240719091404.652 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /snap/lxd/current/bin/lxd)

lxc c1 20240719091404.652 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "c1" stop produced output: /snap/lxd/current/bin/lxd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /snap/lxd/current/bin/lxd)

lxc c1 20240719091404.653 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc c1 20240719091404.653 ERROR    start - ../src/src/lxc/start.c:lxc_end:985 - Failed to run lxc.hook.post-stop for container "c1"

Looks like it could be to do with the core base change from 5.0 to 5.21 (core20 to core22).

tomponline avatar Jul 19 '24 09:07 tomponline

@simondeziel @mihalicyn btw ive confirmed this is also an issue going from 5.0/stable to 5.0/edge so it'll need to be sorted before we put out 5.0.4 with core22.

tomponline avatar Jul 19 '24 09:07 tomponline

@tomponline Any chance for a fix here? This effectively blocks users from updating LXD on productive machines.

mariusklocke avatar Sep 09 '24 14:09 mariusklocke

Yep, a good chance, its on my list as a high priority, we know what the issue is and have a plan to address it.

tomponline avatar Sep 09 '24 15:09 tomponline

Working on a fix here https://github.com/canonical/lxd/pull/14182

tomponline avatar Sep 30 '24 12:09 tomponline

The issue here is that between LXD 5.0.x and LXD 5.21.0 onwards the base snap version has changed from core20 to core22.

This has involved changing the glibc version and LXD dependencies have changed version too.

LXD is reloaded when a snap is upgraded and so it takes on the new dependencies of the core snap being used because the LD_LIBRARY_PATH environment variable is updated to point to the new revision.

However the running containers are not restarted, and so their LXC monitor process continues to have the old LD_LIBRARY_PATH environment variable.

The problem is that when eventually the running containers are stopped/restarted, they need to notify the LXD daemon that they have stopped so that the host-side clean up can occur (things like removing the host-side NIC interfaces, unmounting volume(s) and removing unix devices).

The way this works normally is that LXD receives a notification on its local unix socket from the LXC monitor process because when it starts the container it instructs the monitor process to call lxd callhook .. {stopns|stop} which in turn connects to the LXD unix socket to signal the container is stopping.

LXD then takes that notification, creates/continues a stop operation, performs the host-side cleanup and when done completes the stop operation.

Unfortunately due to the base snap change from core20 to core22 the lxd callhook command fails to run because it cannot fulfill its dynamic dependencies due to the core22 libs not being compatible with the staged libs its finding due to the old LD_LIBRARY_PATH env var it was started with.

The long-term fix for this is to update the LXD snap so that it doesn't start LXD (and the LXC monitor process) with an LD_LIBRARY_PATH containing the specific snap revision, but instead use /snap/lxd/current prefix so that we always use the current snap's libraries (both core and staged).

Unfortunately this cannot be changed for running containers, so a workaround was attempted by creating a small 2-line wrapper script called lxd-lxc-stop that simply forced the LD_LIBRARY_PATH to the /snap/lxd/current prefix and then run LXD from /snap/lxd/current/bin/lxd. This works, even when using the old environment variables from LXD 5.0 snap.

The intention was to then add a start time patch to LXD to update the stop hook commands for any running containers to use the new wrapper script.

Unfortunately it was discovered by @mihalicyn that liblxc does not allow updating the stop hook paths after the container has started, meaning that we cannot instruct the running containers to use the new wrapper script when they are eventually stopped.

We will need to think of another solution.

tomponline avatar Oct 01 '24 10:10 tomponline

This is a problem because we are also planning to change the base snap of LXD 5.0.x to core22 which would then introduce the same issue for a point release upgrade.

tomponline avatar Oct 01 '24 10:10 tomponline

I think one of possible solutions is to overmount /snap/lxd/current/bin/lxd during snap refresh with a shell script which you posted in https://github.com/canonical/lxd/pull/14182. That's what we have in lxc container generated config:

lxc.hook.pre-start = /proc/55526/exe callhook /var/snap/lxd/common/lxd "default" "j" start
lxc.hook.stop = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "j" stopns
lxc.hook.post-stop = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd "default" "j" stop

Overmount should work pretty well.

Edit: or even simpler. We can put an LXD binary let say in bin/_lxd and shell script in bin/lxd with the following content (not tested):

#!/bin/sh

# override LD_LIBRARY_PATH only for callhook subcommand
if [ "$1" == "callhook" ]; then
    export LD_LIBRARY_PATH="/snap/lxd/current/lib:/snap/lxd/current/lib/${ARCH}"
fi

/snap/lxd/current/bin/_lxd "$@"

mihalicyn avatar Oct 01 '24 10:10 mihalicyn

Good idea, I'll try that. I might just remove the callhook conditional entirely and force the same set of LD_LIBRARY_PATHs for all invocations.

I prefer that idea to overmounting as I would like to avoid introducing additional mounts inside the snap mount namespace if I can avoid it!

tomponline avatar Oct 01 '24 10:10 tomponline

Good idea, I'll try that. I might just remove the callhook conditional entirely and force the same set of LD_LIBRARY_PATHs for all invocations.

yeah, but then LD_LIBRARY_PATH has to be specified the same way we do it in https://github.com/canonical/lxd-pkg-snap/blob/74dd6f841241bc4de268b1e082c46aab4c40c40f/snapcraft/commands/daemon.start#L27 to cover all cases like with ZFS/no ZFS/with ceph/with GPU snap and so on.

I prefer that idea to overmounting as I would like to avoid introducing additional mounts inside the snap mount namespace if I can avoid it!

Sure, cause if we do overmounting we have to do a proper cleanup during further refreshes. Additional complexity..

mihalicyn avatar Oct 01 '24 11:10 mihalicyn

The fix worked but introduced other issues, need to revisit.

tomponline avatar Oct 14 '24 07:10 tomponline

The fix for this is in LXD 6.2 currently.

tomponline avatar Nov 28 '24 23:11 tomponline

@tomponline Any chance for a fix in LTS versions? Due to this issue we had to postpone LXD upgrades for machines running LXD 5.0.3. You mentioned that the 5.0.x series should be updated to core22, too. Does upgrading from 5.0.3 to 5.0.4 and then to 5.21 help here?

mariusklocke avatar Dec 03 '24 10:12 mariusklocke

@tomponline Any chance for a fix in LTS versions? Due to this issue we had to postpone LXD upgrades for machines running LXD 5.0.3. You mentioned that the 5.0.x series should be updated to core22, too. Does upgrading from 5.0.3 to 5.0.4 and then to 5.21 help here?

Yep this fix will be backported, but first we need to ensure it has "baked" for a while in the 6/stable and latest/stable channels. We plan to push 6.2 out to latest/stable next week. I plan to include the fix in LXD 5.21.3 LTS if all goes well.

You mentioned that the 5.0.x series should be updated to core22, too. Does upgrading from 5.0.3 to 5.0.4 and then to 5.21 help here?

No it does not, its the switch from core20 to core22 that is the problem, so we will need to backport the fix all the way back to the 5.0 series so that when we change the core of 5.0 to core22 we dont see the same problem.

tomponline avatar Dec 03 '24 10:12 tomponline

But upgrading to LXD 5.0.4 can be done now as the base snap hasnt changed.

tomponline avatar Dec 03 '24 10:12 tomponline

The upgrade fix is now in LXD 5.21.3 which is in 5.21/candidate channel for evaluation.

https://discourse.ubuntu.com/t/lxd-5-21-3-lts-has-been-released/53768

tomponline avatar Jan 23 '25 15:01 tomponline