Unable to delete operations after failed migration
Is there an existing issue for this?
- [x] There is no existing issue for this bug
Is this happening on an up to date version of Incus?
- [x] This is happening on a supported version of Incus
Incus system details
config:
core.https_address: 172.18.0.2:8443
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
- resources_cpu_flags
- disk_io_bus_cache_filesystem
- instance_oci
- clustering_groups_config
- instances_lxcfs_per_instance
- clustering_groups_vm_cpu_definition
- disk_volume_subpath
- projects_limits_disk_pool
- network_ovn_isolated
- qemu_raw_qmp
- network_load_balancer_health_check
- oidc_scopes
- network_integrations_peer_name
- qemu_scriptlet
- instance_auto_restart
- storage_lvm_metadatasize
- ovn_nic_promiscuous
- ovn_nic_ip_address_none
- instances_state_os_info
- network_load_balancer_state
- instance_nic_macvlan_mode
- storage_lvm_cluster_create
- network_ovn_external_interfaces
- instances_scriptlet_get_instances_count
- cluster_rebalance
- custom_volume_refresh_exclude_older_snapshots
- storage_initial_owner
- storage_live_migration
- instance_console_screenshot
- image_import_alias
- authorization_scriptlet
- console_force
- network_ovn_state_addresses
- network_bridge_acl_devices
- instance_debug_memory
- init_preseed_storage_volumes
- init_preseed_profile_project
- instance_nic_routed_host_address
- instance_smbios11
- api_filtering_extended
- acme_dns01
- security_iommu
- network_ipv4_dhcp_routes
- network_state_ovn_ls
- network_dns_nameservers
- acme_http01_port
- network_ovn_ipv4_dhcp_expiry
- instance_state_cpu_time
- network_io_bus
- disk_io_bus_usb
- storage_driver_linstor
- instance_oci_entrypoint
- network_address_set
- server_logging
- network_forward_snat
- memory_hotplug
- instance_nic_routed_host_tables
- instance_publish_split
- init_preseed_certificates
- custom_volume_sftp
- network_ovn_external_nic_address
- network_physical_gateway_hwaddr
- backup_s3_upload
- snapshot_manual_expiry
- resources_cpu_address_sizes
- disk_attached
- limits_memory_hotplug
- disk_wwn
- server_logging_webhook
- storage_driver_truenas
- container_disk_tmpfs
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
addresses:
- 172.18.0.2:8443
architectures:
- x86_64
- i686
certificate: [REDACTED]
certificate_fingerprint: [REDACTED]
driver: lxc | qemu
driver_version: 6.0.5 | 9.0.4
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
uevent_injection: "true"
unpriv_binfmt: "true"
unpriv_fscaps: "true"
kernel_version: 6.12.43+deb13-amd64
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Debian GNU/Linux
os_version: "13"
project: default
server: incus
server_clustered: false
server_event_mode: full-mesh
server_name: [REDACTED]
server_pid: 2110
server_version: "6.16"
storage: zfs
storage_version: 2.3.2-2
storage_supported_drivers:
- name: lvm
version: 2.03.31(2) (2025-02-27) / 1.02.205 (2025-02-27) / 4.48.0
remote: false
- name: lvmcluster
version: 2.03.31(2) (2025-02-27) / 1.02.205 (2025-02-27) / 4.48.0
remote: true
- name: truenas
version: 0.7.3
remote: true
- name: dir
version: "1"
remote: false
- name: zfs
version: 2.3.2-2
remote: false
Instance details
No response
Instance log
No response
Current behavior
After a failed live migrations (similar to #2241) I ended up with two VMs on two hosts: STOPPED on the source host and FROZEN on the second, and a couple of stuck operations:
- source:
root@m2 ~# in operation list
+--------------------------------------+------+--------------------+---------+------------+----------------------+
| ID | TYPE | DESCRIPTION | STATE | CANCELABLE | CREATED |
+--------------------------------------+------+--------------------+---------+------------+----------------------+
| 0dc95f7b-1bdc-4460-9623-a734102896b7 | TASK | Migrating instance | RUNNING | NO | 2025/09/29 08:51 UTC |
+--------------------------------------+------+--------------------+---------+------------+----------------------+
- target:
root@m1 ~# in operation list
+--------------------------------------+-----------+-------------------+---------+------------+----------------------+
| ID | TYPE | DESCRIPTION | STATE | CANCELABLE | CREATED |
+--------------------------------------+-----------+-------------------+---------+------------+----------------------+
| b40f9c93-1de8-4a13-982f-cbf7ebb7eb9c | WEBSOCKET | Creating instance | RUNNING | NO | 2025/09/29 08:51 UTC |
+--------------------------------------+-----------+-------------------+---------+------------+----------------------+
These operations are stuck and fail to complete, so I decided to delete them. I have had bad luck deleting operations before after a failed migration (it dropped my VM on both hosts), but this time my VM is disposable, so I went ahead and tried incus operation delete and incus rm -f on both hosts. I managed to remove the source VM (the one in STOPPED state), but all other objects gave me errors:
incus rm -f $frozen_vmon the target:
Error: Stopping the instance failed: Instance is busy running a "create" operation
incus operation delete b40f9c93-1de8-4a13-982f-cbf7ebb7eb9con the target:
Error: This operation can't be cancelled
incus operation delete 0dc95f7b-1bdc-4460-9623-a734102896b7on the source:
Error: This operation can't be cancelled
So my VM and both operations are stuck. Previously I got rid of such objects by systemctl restart incus, but it's the last resort because it stops all containers and VMs (which is a problem per se, but not related to this ticket).
Expected behavior
There should be a way to cancel failed migration without restarting the whole daemon including all the containers/VMs.
Steps to reproduce
- Create two machines running incus
- Run a failing migration (in my case, according to qemu.log for the VM, it was caused by
qemu-system-x86_64: Issue while setting TUNSETSTEERINGEBPF: Invalid argument with fd: 36, prog_fd: -1) - Try to clean up operations/VMs
We'd need a reliable reproducer for the migration error itself as FROZEN is a pretty unusual state for a VM to be in. Normally a migration failure causes an error on both side which causes the target to get deleted, clearly that didn't happen here, so we'd want to have a way to reproduce that to see what's going on in QEMU.
Sure, if I manage to reliably reproduce a failing migration, I'll post it. But it's not the migration issue I'm concerned about (there's already #2241), it's operations that can not be canceled. I just thought I can directly edit the sqlite DB, but it doesn't feel safe to me.
For example, the "Migrating instance" operation on the source host: why it must be uncancelable? I've already removed the source VM, so there's not a single chance for the operation to complete. And even if I didn't: there can be reasons to cancel the migration. For example, a user starts the migration and then finds out that there's no enough free space (or RAM) on the target. Or he might make a typo and specify a wrong VM or a wrong target. In either case he should be able to ctrl+c and correct the command. There might be some point in time (probably at the very end of the migration) when it's not feasible to interrupt the process, but at least the whole period when the disk is copied can be interruptible/cancelable.
They're not cancelable because there are functions (goroutines) running on the source or target Incus servers which cannot be cancelled. In this case, there may still be a socket connection stuck with QEMU or a filesystem migration connection is still established, ...
All operations go away on daemon restart since at that point any background code will also die.
An update: I restarted incus on the source host (also updated it from Debian repos, but it's likely unrelated), and rebooted the source host just in case: it removed the stuck operations on both hosts, as well as the FROZEN VM on the target. So I suspect the target cleaned itself up after the source has gone
Just discovered, that the machine that was FROZEN on the target and was removed after restarting the source, was actually migrated and started: it's not displayed in the incus list output, but the qemu-system-x86_64 process was running, and the volume exists: I can attach it to the host system and mount partitions.
Though I cannot remove this volume:
root@m1 ~ [1]# in storage volume rm default virtual-machine/jr-gate
Error: Storage volumes of type "virtual-machine" cannot be deleted with the storage API
root@m1 ~ [1]# incus rm jr-gate
Error: Failed checking instance exists "local:jr-gate": Failed to fetch instance "jr-gate" in project "default": Instance not found
The target probably was just running QEMU in live migration receive mode and was waiting for the data stream from the source, something went wrong with QEMU which caused the hang/failure. Incus doesn't really see what's going on at the QEMU level in that situation so it was just waiting for things to complete.
When you killed the source, that finally caused the target to try and clean things up but since its QEMU didn't see the failure, you end up with some leftovers...
So yeah, not exactly ideal... Is that something you can still reproduce easily?
Also, you say on two hosts and that's not in a cluster, so are both hosts running the exact same CPU? If not, you're likely to run into some trouble because of that. Clusters compute a baseline virtual CPU to allow for live-migration across diverging systems, but that's not a thing with two standalone servers. Different models from the same vendor within a CPU generation may be okay, but crossing generations or vendors will almost certainly cause issues.
You'd also need to make sure that QEMU is the same version on both ends. Clusters again have a small edge there that they can look at the full config because the instance already exists. We have a volatile key in there which tells us the QEMU machine profile that was used at the time the instance was started. That allows migrating to a newer QEMU without hitting issues. Moving to an older QEMU (especially major versions) may fail.
Anyway, assuming that you're running the same Incus, QEMU and CPU on source and target, further debugging would likely need:
incus monitor --prettyrunning from before the migration attempt until the end of cleanup after you restart the source to unblock things- Goroutine dump on source and target servers while they're stuck (
incus config set core.debug_address=127.0.0.1:8444followed bycurl http://127.0.0.1:8444/debug/pprof/goroutine?debug=2) ps fauxwwon both servers while things are stuckAll log files in /var/log/incus/INSTANCE-NAME/on both source and target (qemu.log and qemu.qmp.log should be the most useful)
Also, you say on two hosts and that's not in a cluster, so are both hosts running the exact same CPU?
The machines running incus are exactly the same in terms of CPU and memory, and I try to keep them as similar as possible software-wise (they run Debian 13, so incus itself is probably the most frequently updated package there).
Is that something you can still reproduce easily?
I don't do a lot of migrations now, so not easily. When I migrate test machines (not under high load), it always works.
Anyway, assuming that you're running the same Incus, QEMU and CPU on source and target, further debugging would likely need
Thanks for the instructions, I'll use them for the next migration.
Btw, I have disabled overcommit on my hypervisors since when the ticket was created, so maybe the problem was related to the available memory and I won't be able to reproduce it.