lxd icon indicating copy to clipboard operation
lxd copied to clipboard

VM fail to start in Ubuntu 18.04 using 5.0.3 (worked on 5.0.2)

Open verterok opened this issue 1 year ago • 9 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: 18.04
  • The output of "lxc info" or if that fails:
 lxc info
config:
  core.proxy_ignore_hosts: 10.131.1.106,10.131.1.171,10.131.1.80,127.0.0.1,::1,localhost
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- vsock_api
- storage_volumes_all_projects
- projects_networks_restricted_access
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- cpu_hotplug
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- storage_pool_loop_resize
- migration_vm_live
- auth_user
- instances_state_total
- numa_cpu_placement
- network_allocations
- storage_api_remote_volume_snapshot_copy
- zfs_delegate
- operations_get_query_all_projects
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- operation_wait
- cluster_internal_custom_volume_copy
- instance_move_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: jenkins
auth_user_method: unix
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
...
    -----END CERTIFICATE-----
  certificate_fingerprint: 0ae730d159455239ac72e770bab8043ab1817cdd055497cbcebe6ae18af3ca89
  driver: lxc | qemu
  driver_version: 5.0.3 | 8.0.5
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "false"
    netnsid_getifaddrs: "false"
    seccomp_listener: "false"
    seccomp_listener_continue: "false"
    shiftfs: "false"
    uevent_injection: "false"
    unpriv_fscaps: "true"
  kernel_version: 4.15.0-221-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "18.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: juju-545b99-jenkins-13
  server_pid: 14662
  server_version: 5.0.3
  storage: zfs
  storage_version: 0.7.5-1ubuntu16.12
  storage_supported_drivers:
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: ceph
    version: 15.2.17
    remote: true
  - name: cephfs
    version: 15.2.17
    remote: true
  - name: cephobject
    version: 15.2.17
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.37.0
    remote: false
  - name: zfs
    version: 0.7.5-1ubuntu16.12
    remote: false
  • Kernel version:
  • LXC version:
  • LXD version:
  • Storage backend in use:

Issue description

LXD is unable to start a VM, the cause seems to be that qemu process is killed and shows as [qemu-system-x86] <defunct> in ps aux output.

Steps to reproduce

  1. lxc init --vm ubuntu:focal vm-test
  2. lxc start vm-test
  3. command in 2. never finsih and vm-test is not started

Information to attach

$ sudo ps aux | grep qemu
root     11584  0.1  0.4 273928 34696 ?        Ssl  14:40   0:00 /snap/lxd/26881/bin/qemu-system-x86_64 -S -name vm-test -uuid 26db2276-06e3-48ab-b4d2-828e654606c6 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/vm-test/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/vm-test/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/vm-test/qemu.pid -D /var/snap/lxd/common/lxd/logs/vm-test/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
root     11591  0.0  0.0      0     0 ?        Zs   14:40   0:00 [qemu-system-x86] <defunct>
root     11593  0.0  0.5 1593764 41584 ?       Sl   14:40   0:00 /snap/lxd/26881/bin/qemu-system-x86_64 -S -name vm-test -uuid 26db2276-06e3-48ab-b4d2-828e654606c6 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/vm-test/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/vm-test/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/vm-test/qemu.pid -D /var/snap/lxd/common/lxd/logs/vm-test/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
jenkins  13112  0.0  0.0  14860  1064 ?        S    14:47   0:00 grep qemu

$ sudo ps aux | grep lxc
jenkins  10719  0.0  0.2 1910800 19144 ?       Sl   14:39   0:00 /snap/lxd/26881/bin/lxc monitor --type=logging --pretty
jenkins  13115  0.0  0.0  14860  1080 ?        S    14:47   0:00 grep lxc
root     14647  0.0  0.0 152140   272 ?        Sl   Feb08   0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid

verterok avatar Feb 09 '24 15:02 verterok

Good news (or bad), I can reproduce the issue when running on Bionic (4.15.0-213.224), I get the same defunct QEMU, nothing weird in dmesg.

simondeziel avatar Feb 16 '24 22:02 simondeziel

@verterok switching to the HWE kernel (5.4.0-150-generic) made it work.

simondeziel avatar Feb 16 '24 22:02 simondeziel

If I go back to 4.15 kernel and strace the hung QEMU, here's what I get:

root@hogplum:~# ps fauxZ | tail
unconfined                      ubuntu    2240  0.0  0.0  76632  7820 ?        Ss   22:36   0:00 /lib/systemd/systemd --user
unconfined                      ubuntu    2241  0.0  0.0 259232  2568 ?        S    22:36   0:00  \_ (sd-pam)
unconfined                      root      3745  0.6  0.0 1849236 32928 ?       Ssl  22:37   0:07 /usr/lib/snapd/snapd
unconfined                      root      6340  0.0  0.0   2620  1688 ?        Ss   22:56   0:00 /bin/sh /snap/lxd/27037/commands/daemon.start
unconfined                      root      6522  1.7  0.2 2653724 137448 ?      Sl   22:56   0:01  \_ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd> (enforce) lxd 6682 0.4  0.0 9148 4400 ? Ss 22:56   0:00      \_ dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --pid-file= --no-ping --interface=lxdbr0 --dhcp-rapid-commit --no-negcache --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.141.18.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.141.18.2,10.141.18.254,1h --listen-address=fd42:40ec:efe2:edac::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd --interface-name _gateway.lxd,lxdbr0 -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd -g lxd
lxd-vm-test_</var/snap/lxd/common/lxd> (enforce) root 6800 0.0  0.0 273928 35656 ? Ssl 22:56   0:00      \_ /snap/lxd/27037/bin/qemu-system-x86_64 -S -name vm-test -uuid bfbb0567-525b-44e1-b99a-03680aae105e -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/vm-test/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/vm-test/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/vm-test/qemu.pid -D /var/snap/lxd/common/lxd/logs/vm-test/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
lxd-vm-test_</var/snap/lxd/common/lxd> (enforce) root 6807 0.0  0.0 0 0 ?      Zs   22:56   0:00          \_ [qemu-system-x86] <defunct>
unconfined                      root      6510  0.0  0.0 152136  2064 ?        Sl   22:56   0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
lxd-vm-test_</var/snap/lxd/common/lxd> (enforce) root 6809 0.0  0.0 1593788 41688 ? Sl 22:56   0:00 /snap/lxd/27037/bin/qemu-system-x86_64 -S -name vm-test -uuid bfbb0567-525b-44e1-b99a-03680aae105e -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/vm-test/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/vm-test/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/vm-test/qemu.pid -D /var/snap/lxd/common/lxd/logs/vm-test/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
root@hogplum:~# timeout 30 strace -p 6809
strace: Process 6809 attached
recvmsg(20, strace: Process 6809 detached
 <detached ...>

And fd 20 is:

root@hogplum:~# ll /proc/6809/fd
total 0
dr-x------ 2 root root  0 Feb 16 22:56 ./
dr-xr-xr-x 9 root root  0 Feb 16 22:56 ../
lr-x------ 1 root root 64 Feb 16 22:59 0 -> /dev/null
lrwx------ 1 root root 64 Feb 16 22:59 1 -> /var/snap/lxd/common/lxd/logs/vm-test/qemu.early.log
lrwx------ 1 root root 64 Feb 16 22:59 10 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Feb 16 22:59 11 -> 'socket:[44153]'
lrwx------ 1 root root 64 Feb 16 22:59 12 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Feb 16 22:59 13 -> 'anon_inode:[eventfd]'
lr-x------ 1 root root 64 Feb 16 22:59 14 -> /dev/urandom
lrwx------ 1 root root 64 Feb 16 22:59 15 -> 'socket:[41606]'
lrwx------ 1 root root 64 Feb 16 22:59 16 -> 'socket:[41607]'
lrwx------ 1 root root 64 Feb 16 22:59 17 -> 'socket:[41608]'
lrwx------ 1 root root 64 Feb 16 22:59 18 -> 'socket:[41609]'
lrwx------ 1 root root 64 Feb 16 22:59 19 -> 'socket:[41610]'
l-wx------ 1 root root 64 Feb 16 22:59 2 -> /var/snap/lxd/common/lxd/logs/vm-test/qemu.log
lrwx------ 1 root root 64 Feb 16 22:59 20 -> 'socket:[41611]'
lr-x------ 1 root root 64 Feb 16 22:59 21 -> /snap/lxd/27037/share/qemu/OVMF_CODE.fd
lrwx------ 1 root root 64 Feb 16 22:59 22 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Feb 16 22:59 23 -> /dev/kvm
lrwx------ 1 root root 64 Feb 16 22:59 24 -> anon_inode:kvm-vm
lrwx------ 1 root root 64 Feb 16 22:59 25 -> '/memfd:memory-backend-memfd (deleted)'
lrwx------ 1 root root 64 Feb 16 22:59 26 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Feb 16 22:59 27 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Feb 16 22:59 28 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Feb 16 22:59 29 -> anon_inode:kvm-vcpu
lrwx------ 1 root root 64 Feb 16 22:59 3 -> /dev/vhost-vsock
lrwx------ 1 root root 64 Feb 16 22:59 30 -> /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/vm-test/OVMF_VARS.ms.fd
lrwx------ 1 root root 64 Feb 16 22:59 31 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Feb 16 22:59 32 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Feb 16 22:59 33 -> 'anon_inode:[eventfd]'
lr-x------ 1 root root 64 Feb 16 22:59 34 -> /var/snap/lxd/common/lxd/devices/vm-test/config.mount/
lr-x------ 1 root root 64 Feb 16 22:56 4 -> /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/vm-test/OVMF_VARS.ms.fd
lrwx------ 1 root root 64 Feb 16 22:59 5 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Feb 16 22:59 6 -> 'anon_inode:[eventfd]'
l-wx------ 1 root root 64 Feb 16 22:59 7 -> /var/snap/lxd/common/lxd/logs/vm-test/qemu.pid
l-wx------ 1 root root 64 Feb 16 22:59 8 -> 'pipe:[47138]'
lrwx------ 1 root root 64 Feb 16 22:59 9 -> 'anon_inode:[signalfd]'

simondeziel avatar Feb 16 '24 22:02 simondeziel

@simondeziel thanks for looking into this! Indeed, using the HWE kernel is our current workaround and we will plan moving the instances to focal

cheers!

verterok avatar Feb 21 '24 12:02 verterok

5.21/stable and 5.21/edge also present the same problem. Tried to change the QEMU version on the snap, neither v8.0.5(used in LXD 5.0.3) or v8.0.3 work but when reverting to v7.1.0 (LXD 5.0.2 version) VM start works again. I also tried QEMU v9.0.1 and that also works, apparently a fix for whatever is happening here was introduced on their side after v8.2.1.

hamistao avatar Jun 14 '24 04:06 hamistao

@hamistao just for the record, what kernel version are you using (uname -a)?

tomponline avatar Jun 14 '24 10:06 tomponline

I am using 4.15.0-213-generic

hamistao avatar Jun 14 '24 10:06 hamistao

I am using 4.15.0-213-generic

This kernel is from 2023-06-28. Surely there is one newer available in the ESM repos. I think we should try an see if the latest supported kernel from the ESM repos is also affected by the regression on QEMU 8.0.5 (from LXD 5.0.3).

simondeziel avatar Jun 14 '24 12:06 simondeziel

It's a long shot but I think it could have to do with our use of Hyper-V features. Here's a workaround that make it work with a 4.15 kernel and QEMU 8.0.5:

root@hardhat:~# uname -a
Linux hardhat 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@hardhat:~# snap list lxd
Name  Version        Rev    Tracking    Publisher   Notes
lxd   5.0.3-d921d2e  28373  5.0/stable  canonical✓  -
root@hardhat:~# lxc info | grep -A1 -w qemu
  driver: lxc | qemu
  driver_version: 5.0.3 | 8.0.5

root@hardhat:~# lxc launch ubuntu:focal vm-test --vm -d root,size.state=2GiB -c migration.stateful=true
root@hardhat:~# sleep 60
root@hardhat:~# IP=$(lxc list -c4 -f csv vm-test | cut -d\  -f1)
root@hardhat:~# echo $IP
10.156.110.81
root@hardhat:~# ping $IP
PING 10.156.110.81 (10.156.110.81) 56(84) bytes of data.
64 bytes from 10.156.110.81: icmp_seq=1 ttl=64 time=0.327 ms
64 bytes from 10.156.110.81: icmp_seq=2 ttl=64 time=0.255 ms
^C
--- 10.156.110.81 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.255/0.291/0.327/0.036 ms

The migration.stateful=true was used as a proxy to turn off the Hyper-V feature. https://gitlab.com/qemu-project/qemu/-/commit/9b98ab7d3d44911552063cfa3863b67ab79ef783 is what tipped me to try this. This specific commit landed in QEMU 9.0.1 (tested to be working by @hamistao) and also in 8.2.5.

Now I really don't know if that commit has anything to do with the regression, it's just a hunch.

simondeziel avatar Jun 14 '24 18:06 simondeziel

5.21/stable and 5.21/edge also present the same problem. Tried to change the QEMU version on the snap, neither v8.0.5(used in LXD 5.0.3) or v8.0.3 work but when reverting to v7.1.0 (LXD 5.0.2 version) VM start works again. I also tried QEMU v9.0.1 and that also works, apparently a fix for whatever is happening here was introduced on their side after v8.2.1.

Please can you show how you confirmed that v7.1.0 and v9.0.1 worked?

As I have rebuilt 5.0/edge (core22 based) using these 2 versions and I get the same issue still.

tomponline avatar Jul 19 '24 11:07 tomponline

I think the hyperv thing is a misnomer (I did try the patch to no avail btw) as that feature is already gated by a kernel version check >= 5.10.0 so on a 4.15 kernel it wont be being used.

I confirmed that setting migration.stateful=true allows the VM to start, so infact it turns out to be virtiofs config drive that is the difference in the qemu.conf file:

< # Config drive (virtio-fs)
< [chardev "qemu_config"]
< backend = "socket"
< path = "/var/snap/lxd/common/lxd/logs/v1/virtio-fs.config.sock"
< 
< [device "dev-qemu_config-drive-virtio-fs"]
< driver = "vhost-user-fs-pci"
< bus = "qemu_pcie2"
< addr = "00.1"
< tag = "config"
< chardev = "qemu_config"
< 

tomponline avatar Jul 19 '24 11:07 tomponline

so infact it turns out to be virtiofs config drive that is the difference in the qemu.conf file

Then that would probably be the reason my tests worked then, I primed virtiofsd from bin/virtiofsd on qemu.

hamistao avatar Jul 19 '24 11:07 hamistao

Then that would probably be the reason my tests worked then, I primed virtiofsd from bin/virtiofsd on qemu.

In which version? Please show your reproducer steps and file changes to help others continue this debugging.

Because 8.0 doesn't come with virtiofsd i believe, hence the separate section to build it.

tomponline avatar Jul 19 '24 12:07 tomponline

Looks to me like virtiofsd isn't starting or is starting and then stopping, but leaving behind virtio-fs.config.sock which is then used as an indicator that its available to qemu, so qemu then writes its config file to use it, and so qemu then blocks on that socket as nothing is connected to it.

tomponline avatar Jul 19 '24 12:07 tomponline

/var/lib/snapd/hostfs/snap/lxd/current/bin/virtiofsd -o source=/var/snap/lxd/common/lxd/devices/v1/config.mount --socket-path=/var/snap/lxd/common/lxd/logs/v1/virtio-fs.config.sock
[2024-07-19T12:30:19Z WARN  virtiofsd] Use of deprecated option format '-o': Please specify options without it (e.g., '--cache auto' instead of '-o cache=auto')
[2024-07-19T12:30:19Z ERROR virtiofsd] Error entering sandbox: Fork(Os { code: 38, kind: Unsupported, message: "Function not implemented" })

tomponline avatar Jul 19 '24 12:07 tomponline

/var/lib/snapd/hostfs/snap/lxd/current/bin/virtiofsd -o source=/var/snap/lxd/common/lxd/devices/v1/config.mount --socket-path=/var/snap/lxd/common/lxd/logs/v1/virtio-fs.config.sock --sandbox=namespace
[2024-07-19T12:31:42Z WARN  virtiofsd] Use of deprecated option format '-o': Please specify options without it (e.g., '--cache auto' instead of '-o cache=auto')
[2024-07-19T12:31:42Z ERROR virtiofsd] Error entering sandbox: Fork(Os { code: 38, kind: Unsupported, message: "Function not implemented" })

tomponline avatar Jul 19 '24 12:07 tomponline

/var/lib/snapd/hostfs/snap/lxd/current/bin/virtiofsd -o source=/var/snap/lxd/common/lxd/devices/v1/config.mount --socket-path=/var/snap/lxd/common/lxd/logs/v1/virtio-fs.config.sock --sandbox=chroot
[2024-07-19T12:31:59Z WARN  virtiofsd] Use of deprecated option format '-o': Please specify options without it (e.g., '--cache auto' instead of '-o cache=auto')
[2024-07-19T12:31:59Z INFO  virtiofsd] Waiting for vhost-user socket connection...

OK so the issue is with running virtiofsd with namespace sandbox mode, doesn't seem to be supported on 4.15 kernels.

tomponline avatar Jul 19 '24 12:07 tomponline

Yep that made it work

tomponline avatar Jul 19 '24 12:07 tomponline

@tomponline Are my steps and files still necessary?

hamistao avatar Jul 19 '24 12:07 hamistao

@tomponline Are my steps and files still necessary?

Well I was wondering how you used the non-rust virtiofsd from qemu when its not provided anymore from 8.x (I forget) onwards?

tomponline avatar Jul 19 '24 12:07 tomponline

Likely, this is a place where it fails https://gitlab.com/virtio-fs/virtiofsd/-/blob/main/src/util.rs#L64 pidfd_open is not available on 4.15 kernel

mihalicyn avatar Jul 19 '24 12:07 mihalicyn

Thanks for your help.

We decided to switch to using --sandbox=chroot for 5.0 series to restore compatibility with 4.15 kernel.

tomponline avatar Jul 19 '24 14:07 tomponline

Well I was wondering how you used the non-rust virtiofsd from qemu when its not provided anymore from 8.x (I forget) onwards?

I didn't do anything complicated, just commented the virtiofsd section because snapcraft kept complaining about there being two virtiofsd binaries. Here are the changes I made. It was been some time and my notes are a little scrambled, but what I sent is at least very close to what I used. If needed I can take some time to redo my tests.

hamistao avatar Jul 20 '24 18:07 hamistao

Well I was wondering how you used the non-rust virtiofsd from qemu when its not provided anymore from 8.x (I forget) onwards?

I didn't do anything complicated, just commented the virtiofsd section because snapcraft kept complaining about there being two virtiofsd binaries. Here are the changes I made. It was been some time and my notes are a little scrambled, but what I sent is at least very close to what I used. If needed I can take some time to redo my tests.

So it got removed in 8.0 see https://wiki.qemu.org/ChangeLog/8.0 (search for virtiofsd). I wonder if you initially tried to build 7.1.0 and hit the duplicate file issue because it did bundle it back then, and that version didn't use pidfd and so worked, and then you left the virtiofsd section commented out for the 9.0.1 build which resulted in no virtiofsd being built at all and so lxd would have detected this and fallen back to using 9p instead.

tomponline avatar Jul 20 '24 21:07 tomponline

Just redid some tests and that seems to have been the case. I didn't consider 9p, so I thought it would fail if no virtiofsd was present. Sorry for the confusion.

hamistao avatar Jul 20 '24 21:07 hamistao

OK thanks.

So to summarise:

  1. 4.15.0-213.224 kernel didn't work.
  2. 5.4.0-150-generic did work.
  3. Enabling migration.stateful=true on 4.15.0-213.224 kernel did work. This lead to the incorrect theory that it may be related to migration.stateful disabling the Hyperv hv_passthrough CPU feature (because that feature is also gated by a kernel check for 5.10.0 or newer and so was never enabled on 4.15 kernels). It was not identified at this stage that migration.stateful=true also disabled virtiofsd shares.
  4. Following this LXD snap was rebuilt using QEMU v7.1.0, which required removing the newer external rust based virtiofsd build because QEMU 7.1.0 came with its own older virtiofsd (whose namespace sandbox mode didn't require pidfd_open syscall support). This was found to work (but at this stage the old virtiofsd difference had not been identitifed).
  5. Then LXD snap was rebuilt using QEMU v9.0.1, but critically, the external virtiofsd was not re-enabled in the build. This caused virtiofsd to be missing entirely from the snap and so LXD was falling back to 9p mode only, meaning that VMs then started, which was causing the incorrect perception that QEMU v9.0.1 fixed the issue.
  6. Then after escalation the gated hv_passthrough CPU feature and the fact that migration.stateful=true disabled virtiofsd shares was identified. A new snap build was made that tried QEMU v9.0.1 with the external virtiofsd re-enabled and the issue returned. Ruling out that a fix appeared in that release. This turned focus to virtiofsd being the issue and not QEMU.
  7. By entering the LXD snap's mount namespace and trying to run virtiofsd directly, the actual issue was identified as [ERROR virtiofsd] Error entering sandbox: Fork(Os { code: 38, kind: Unsupported, message: "Function not implemented" }).
  8. It was then identified that --sandbox=chroot worked on 4.15 kernels, and that --sandbox=namespace (the default) required the pidfd_open syscall that was not added until 5.3 kernels.
  9. It was decided to add a fix to LXD to use --sandbox=chroot to start virtiofsd on pre 5.3 kernels.

tomponline avatar Jul 22 '24 09:07 tomponline