lxd icon indicating copy to clipboard operation
lxd copied to clipboard

lxc failing with `Error: mkdir /var/snap/lxd/common/lxd/shmounts: file exists` when using snap in parallel mode

Open jameinel opened this issue 2 years ago • 9 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: 22.04.2
  • The output of "lxc info": (I can't help but feel this is no longer the nice helpful summary of what is running):
$ lxc info
config:
  core.https_address: '[::]'
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- vsock_api
- storage_volumes_all_projects
- projects_networks_restricted_access
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- cpu_hotplug
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 172.29.20.18:8443
  - 10.25.164.1:8443
  architectures:
  - x86_64
  - i686
  certificate: |
<elided>
  certificate_fingerprint: 6c1bd7d6ac16fc3623b03a1a2a7f95f35ea204a471e2778d99c8e1c4b95b3fb5
  driver: lxc
  driver_version: 5.0.2
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.15.0-71-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: jammy
  server_pid: 1264
  server_version: 5.0.2
  storage: dir | zfs | btrfs
  storage_version: 1 | 2.1.5-1ubuntu6~22.04.1 | 5.4.1
  storage_supported_drivers:
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0
    remote: false
  - name: zfs
    version: 2.1.5-1ubuntu6~22.04.1
    remote: false
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: ceph
    version: 15.2.17
    remote: true
  - name: cephfs
    version: 15.2.17
    remote: true
  - name: cephobject
    version: 15.2.17
    remote: true

Issue description

Lxc is failing to start containers. I first noticed this while trying to do a juju_29 bootstrap lxd lxd after doing a parallel install of the juju snap. However, the actual failure is happening with only lxc launch in the mix:

Steps to reproduce

  1. Try to launch an LXD container:
    $ lxc launch juju/[email protected]/amd64
    Creating the instance
    Instance name is: proven-mule
    Starting proven-mule
    Error: mkdir /var/snap/lxd/common/lxd/shmounts: file exists
    Try `lxc info --show-log local:proven-mule` for more info
    
  2. Looking at the contents of the directory that path does exist:
$ ll /var/snap/lxd/common/lxd/shmounts
lrwxrwxrwx 1 root root 39 May 10 11:15 /var/snap/lxd/common/lxd/shmounts -> /var/snap/lxd/common/shmounts/instances

However, what it points to, does not:

$ sudo ls -al /var/snap/lxd/common/shmounts
 total 8
 drwx--x--x 2 root root 4096 Jan 20 16:00 .
 drwxr-xr-x 9 root root 4096 May 10 11:15 ..

I can manually delete the symlink, or manually create the instances directory, but I'm not sure what perms should be used. I don't know whether Juju is somehow using an older lxd client library version that set something up incorrectly (but juju the snap shouldn't have any writes to write into those directories anyway, so I'm pretty sure it is LXD the agent who is setting those things up.)

Information to attach

There are only 2 lines in /var/snap/lxd/common/lxd/logs/lxd.log:

time="2023-05-10T11:15:35-04:00" level=warning msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored"
time="2023-05-10T11:15:35-04:00" level=warning msg="Instance type not operational" driver=qemu err="KVM support is missing (no /dev/kvm)" type=virtual-machine

jameinel avatar May 17 '23 22:05 jameinel

Does this occur on all (or a fresh) system? Or just this particular machine?

tomponline avatar May 18 '23 05:05 tomponline

Moved over to the snap packaging repo

stgraber avatar May 18 '23 05:05 stgraber

It happened for 2 relatively fresh systems in my testing.

jameinel avatar May 18 '23 10:05 jameinel

It seems that the factor is installing LXD, and then enabling parallel installs (https://snapcraft.io/docs/parallel-installs), and then trying to launch a container. Vitaly should have a bit more information here.

jameinel avatar May 18 '23 14:05 jameinel

I've also run into this one.

For everyone that does run into this, I can't get parallel-instances to work correctly for now. So disabling is the only option I had. Then I had to restart lxd to get it working again.

$ sudo snap set system experimental.parallel-instances=false
$ sudo snap restart lxd

SimonRichardson avatar Jun 22 '23 10:06 SimonRichardson

I can't seem to reproduce any problem with launching the containers. I've set up a VMs with 22.04 and 24.04, lxd 5.0.3 and 5.21 respectively. Parallel instances enabled, installed test-snapd-sh-core24 and test-snapd-sh-core24_foo in both VMs and launched both to have the proper mounts set up. Then i launched a couple of containers, launched containers within the containers, removed them, no issues. There's a chance this may have been fixed by https://github.com/canonical/lxd-pkg-snap/pull/375 and https://github.com/canonical/lxd-pkg-snap/pull/379

bboozzoo avatar Jun 11 '24 11:06 bboozzoo

@jameinel @bboozzoo happy to close this one?

tomponline avatar Jun 11 '24 11:06 tomponline

SGTM, if @jameinel agrees then let's close it. If the problem shows up again, feel free to file a bug for snapd. to investigate and we can take it from there.

bboozzoo avatar Jun 11 '24 11:06 bboozzoo

For me the same issue as described in the original issue description is still happening after enabling parallel installs.

LXD 5.21.1 LTS, Ubuntu 24.04, snap 2.63+24.04

JoseFMP avatar Jul 18 '24 08:07 JoseFMP