incus
incus copied to clipboard
Adding an Nvidia GPU works sporadically
Required information
- Distribution: Arch Linux
- The output of "incus info":
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: c0rn3j
auth_user_method: unix
environment:
addresses: []
architectures:
- x86_64
- i686
certificate: |
-----BEGIN CERTIFICATE-----
MIICBzCCAY2gAwIBAgIRAJq+jJvvcUBYON1KPndOnUgwCgYIKoZIzj0EAwMwNTEc
MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBMdXh1
cmlhMB4XDTIxMDYyNzE0MjIyOVoXDTMxMDYyNTE0MjIyOVowNTEcMBoGA1UEChMT
bGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBMdXh1cmlhMHYwEAYH
KoZIzj0CAQYFK4EEACIDYgAElX7iyAw8q/fF9Qd1P5cu7r4UM6evd98hGZu1DAmN
8EJsdcjSDheOSJWMwxz8DIihpCn2GmT16QCtjNsPJi/W/n38V0wJU8133xMYz2j1
Ms7rdd3KypcJezCNCaGEFFnHo2EwXzAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAww
CgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAqBgNVHREEIzAhggdMdXh1cmlhhwR/
AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMGe6Htwpu5ab
QZOEcB0H9sS7uMbdyY3NmNQco85vA7Rz8Sx3iGYuxpFNZ6U22iez3AIxAISoiLSX
KarWaTT503kaM2csVqIN+TF8RzT0TO2cQNl8hJ3/seVt7onMX1C7xB7Qjw==
-----END CERTIFICATE-----
certificate_fingerprint: 85a907693fb60e7f3f48f98a97b5a3bcb3cf90f35b5c7027b9c5f4568122f313
driver: lxc | qemu
driver_version: 6.0.0 | 9.0.1
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
uevent_injection: "true"
unpriv_binfmt: "true"
unpriv_fscaps: "true"
kernel_version: 6.9.4-arch1-1
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Arch Linux
os_version: ""
project: default
server: incus
server_clustered: false
server_event_mode: full-mesh
server_name: Luxuria
server_pid: 3876
server_version: "6.2"
storage: btrfs
storage_version: "6.9"
storage_supported_drivers:
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.24(2) (2024-05-16) / 1.02.198 (2024-05-16) / 4.48.0
remote: false
- name: lvmcluster
version: 2.03.24(2) (2024-05-16) / 1.02.198 (2024-05-16) / 4.48.0
remote: true
- name: btrfs
version: "6.9"
remote: false
Issue description
c0rn3j@Luxuria : ~
[0] % incus config show ai
architecture: x86_64
config:
image.architecture: amd64
image.description: Archlinux current amd64 (20240425_04:43)
image.os: Archlinux
image.release: current
image.requirements.secureboot: "false"
image.serial: "20240425_04:43"
image.type: squashfs
image.variant: default
nvidia.runtime: "true"
volatile.base_image: 4f39fcabe30ee9c3a36da0f317ebd1d43a83d405edcad3c0d2be0ef868079e39
volatile.cloud-init.instance-id: a44a0ce2-118a-4e05-a2fe-8f7c1f45b8fe
volatile.eth0.host_name: veth190c1d08
volatile.eth0.hwaddr: 00:16:3e:06:2c:96
volatile.idmap.base: "0"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
volatile.last_state.power: RUNNING
volatile.uuid: bce2b402-db8a-4808-8aeb-27cb3457621c
volatile.uuid.generation: bce2b402-db8a-4808-8aeb-27cb3457621c
devices:
gpu:
type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""
I have added a GPU to the container - this seems to work very sporadically, I think I notice this especially after a driver update and a host reboot - it does not seem to add back properly until I reboot the container perhaps?
Unsure yet how to actually reproduce.
Here's a demo of the broken container spurring back to life after a reboot:
c0rn3j@Luxuria : ~
[0] % incus exec ai -- zsh -c 'ls -lah /dev/nvi*'
crw-rw-rw- 1 nobody nobody 195, 255 Jun 14 18:46 /dev/nvidiactl
c0rn3j@Luxuria : ~
[0] % incus restart ai
c0rn3j@Luxuria : ~
[0] % incus exec ai -- zsh -c 'ls -lah /dev/nvi*'
crw-rw-rw- 1 nobody nobody 236, 0 Jun 14 18:46 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nobody 236, 1 Jun 14 18:46 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Jun 18 12:45 /dev/nvidia0
crw-rw-rw- 1 nobody nobody 195, 255 Jun 14 18:46 /dev/nvidiactl
Information to attach
- [ ] Any relevant kernel output (
dmesg) - [x] Container log (
incus info NAME --show-log) - [ ] Container configuration (
incus config show NAME --expanded) - [ ] Main daemon log (at /var/log/incus/incusd.log)
- [ ] Output of the client with --debug
- [ ] Output of the daemon with --debug (alternatively output of
incus monitor --prettywhile reproducing the issue)
Could be some kind of race condition between the NVIDIA driver stuff loading and the container starting?
Can you maybe try boot.autostart=false on the container so it doesn't start when the system boots up and see if things then behave properly when you first incus start it?
Interesting, I run into a similar issue with NVidia driver version 545.23.08.
After a reboot of the host all container that have a GPU added don't see it. After some investigation I figured out that the cuda environment isn't loaded. Running a simple cuda bandwidthtest before incus starts solves the issue. Wrote a small systemd service to run the bandwidthtest before incus service starts andall container starting full operational.
During my investigation I came across a post or similar where someone mentioned it might be an issue with the latest nvidia driver. Hope this gat'S solved at some stage.
Is this still an active issue for anyone following this?
If so, can you provide some details on kernel version, NVIDIA version, GPU in use and in general what you've been doing?
I did see mention on the forum of something having changed on the NVIDIA front which somehow requires some kind of CUDA initialization, that would line up with what you're seeing by running the bandwidth test.
This shipped on Arch a couple days before I reported this.
I might not have rebooted or mirrors could be not up to date to have the fix at the time of report.
I have also changed my setup to load some CUDA stuff in Docker on host.
One of those two things has fixed/worked around my issues.
Incus could carry such rules I suppose? Just make sure to use the fixed up version of that commit if so to not re-trigger it, see the latest file - https://gitlab.archlinux.org/archlinux/packaging/packages/nvidia-utils/-/blob/main/nvidia.rules?ref_type=heads