stopping/deleting a freshly created container can fail
Required information
On Ubuntu 20.04.4 with LXD from snap (4.24 rev 22674)
$ lxc info
config:
core.https_address: 10.98.30.1
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
addresses:
- 10.98.30.1:8443
architectures:
- x86_64
- i686
certificate: |
-----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----
certificate_fingerprint: 07f26135c114ba171f0204258b0b54df9e46461bf793b99ccb154bb40e8d0697
driver: lxc | qemu
driver_version: 4.0.12 | 6.1.1
firewall: xtables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "false"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
shiftfs: "false"
uevent_injection: "true"
unpriv_fscaps: "true"
kernel_version: 5.13.0-35-generic
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "20.04"
project: default
server: lxd
server_clustered: false
server_event_mode: full-mesh
server_name: sdeziel-lemur
server_pid: 192964
server_version: "4.24"
storage: zfs
storage_version: 2.0.6-1ubuntu2
storage_supported_drivers:
- name: cephfs
version: 15.2.14
remote: true
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0
remote: false
- name: zfs
version: 2.0.6-1ubuntu2
remote: false
- name: ceph
version: 15.2.14
remote: true
- name: btrfs
version: 5.4.1
remote: false
Issue description
Creating a container and immediately stopping or deleting it fails:
$ lxc launch focal t && lxc stop -f t
Creating t
Starting t
Error: Failed unmounting instance: In use
Try `lxc info --show-log t` for more info
$ lxc info --show-log t
Name: t
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/03/18 09:48 EDT
Last Used: 2022/03/18 09:48 EDT
Log:
lxc t 20220318134836.579 WARN conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing
lxc t 20220318134836.579 WARN conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing
lxc t 20220318134836.580 WARN conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing
lxc t 20220318134836.580 WARN conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing
lxc t 20220318134836.580 WARN cgfsng - cgroups/cgfsng.c:fchowmodat:1252 - No such file or directory - Failed to fchownat(42, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc t 20220318134837.695 WARN conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing
lxc t 20220318134837.695 WARN conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing
Similar issue when deleting:
$ lxc launch focal t2 && lxc delete -f t2
Creating t2
Starting t2
Error: Stopping the instance failed: Failed unmounting instance: In use
After such failure, a subsequent lxc stop or lxc delete works.
What storage pool is this, I've not been able to recreate on ZFS loop backed on NVME.
If I run lxc stop -f t in a loop in one window and then launch a fresh container t in another I see:
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: Instance is busy running a "start" operation
Try `lxc info --show-log t` for more info
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Which is what I would expect to see.
It's on a zpool backed by a NVME partition.
$ lxc storage list
+---------+--------+---------+-------------+---------+---------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
+---------+--------+---------+-------------+---------+---------+
| default | zfs | default | | 13 | CREATED |
+---------+--------+---------+-------------+---------+---------+
$ lxc storage show default
config:
source: default
volatile.initial_source: /dev/nvme0n1p6
zfs.pool_name: default
description: ""
name: default
driver: zfs
used_by:
- /1.0/images/39c55b257dc40e812cc1431e04eb1c883b03c349b67fbffb3300d76e42c36176
- /1.0/images/45e0432a28176ae29a29869cd667f87f633c00ef819804ecfb721f82c5acc995?project=charmcraft
- /1.0/images/6abc292fe3f02ad2218127926cda5dba31c6f2f0f21fc929157a2c7051c16f17
- /1.0/images/ae2a19ee7bc7f3b21cf7e529c51c5138ae1b99cbd0186265726c4265c1827303
...
- /1.0/profiles/default
- /1.0/profiles/default?project=charmcraft
status: Created
locations:
- none
Must be extremely fast as I cannot reproduce on my NVME ZFS partition either :(
Extremely fast outta be relative as the machine is from 2016 ;)
Extremely fast outta be relative as the machine is from 2016 ;)
That new ;)
I can theorise that there is a very small race potential here:
https://github.com/lxc/lxd/blob/42bd6dcf91aa3fb0ae4a2ece1ff9c0c1fadd6a98/lxd/instance/operationlock/operationlock.go#L119-L126
Called from Start():
https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/driver_lxc.go#L2355-L2369
and Stop():
https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/driver_lxc.go#L2614-L2624
Such that both Start and Stop both think there is no existing operation and both call Create.
Although the lock in Create should still catch it:
https://github.com/lxc/lxd/blob/42bd6dcf91aa3fb0ae4a2ece1ff9c0c1fadd6a98/lxd/instance/operationlock/operationlock.go#L59-L74
Which is what I see in my tests:
Error: Instance is busy running a "start" operation
I'm happy to run any debug binary you feed me or give you SSH access (to my museum piece) if that's more convenient ;)
Can you get me the debug log output when you see this happening please.
Not sure, it is related, as all attention here goes toward storage. Sounds strange, but an immediate stopping or deleting instance after creation failed in my case, as long as instance was struggling to obtain IP from DHCP server.
I cannot reproduce after snap set lxd daemon.debug=true and systemctl restart snap.lxd.daemon :/
After unset daemon.debug plus lxd restart, I can no longer reproduce it.
Whatever it was, I can no longer reproduce it. I will reopen if it happens again. Thanks Tom.
I just got into a state where this happens constistently. This happened after i was creating => starting => destroying containers in an automated testsuite. I'm not sure what gets me into this state, but restarting the LXD daemon "fixes" this issue.
$> lxc launch testing testing && lxc delete -f testing
Creating testing
Starting testing
Error: Stopping the instance failed: Failed unmounting instance: In use
It will show this error, but the instance will still be deleted.
Sometimes i can see this error in the logs, maybe it's of use:
Every 1.0s: lxc info testing --show-log balance: Thu Sep 8 23:03:02 2022
Name: testing
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/09/08 23:02 CEST
Last Used: 2022/09/08 23:02 CEST
Log:
lxc 20220908210259.562 ERROR af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20220908210259.562 ERROR commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_state"
I'm running arch with with LXD community/lxd 5.5-1 with a dir storage pool.
I've also noticed that LXD is not cleaning up network interfaces anymore when i get into this state.
Annoyingly i still haven't found a reliable way to get into this state, it seems to be related with calls to the /1.0/instances/{name}/state endpoint but i'm not sure. I will investigate further tomorrow.
I can now reproduce this, it seems to be caused by starting an instance while at the same time executing some other operation on it?
I'm testing this with a default profile that only contains a root disk.
This executes succesfully:
~ lxc init images:alpine/edge c && lxc start c; lxc exec c hostname; lxc delete -f c
Creating c
The instance you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to an instance, use: lxc network attach
c
But if i execute:
~ lxc init images:alpine/edge c && lxc start c & lxc exec c hostname; lxc delete -f c
Creating c
The instance you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to an instance, use: lxc network attach
[1] 2384
Error: Instance is not running
~ Error: Failed to run: /usr/bin/lxd forkstart c /var/lib/lxd/containers /var/log/lxd/c/lxc.conf: exit status 1
Try `lxc info --show-log c` for more info
[1] + 2384 exit 1 lxc start c
Then after this, when i execute the first command, i get this Failed unmounting instance: In use error:
~ lxc init images:alpine/edge c && lxc start c; lxc exec c hostname; lxc delete -f c
Creating c
The instance you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to an instance, use: lxc network attach
c
Error: Stopping the instance failed: Failed unmounting instance: In use
The c container logs:
Name: c
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/09/09 16:09 CEST
Last Used: 2022/09/09 16:09 CEST
Log:
lxc c 20220909140914.685 WARN attach - ../src/lxc/attach.c:get_attach_context:477 - No security context eceived
Thats helpful thanks, and would explain why the mount is being left over as it would be triggered by exec when the instance wasn't running. I'll try and reproduce this and fix.
Working on this now.
Is this reproducible on LXD 5.6 or LXD 5.0.1?
I'm running arch with with LXD
community/lxd 5.5-1with adirstorage pool.
Ah just seen.
OK I've reproduced this now.