incus icon indicating copy to clipboard operation
incus copied to clipboard

Adding an Nvidia GPU works sporadically

Open C0rn3j opened this issue 1 year ago • 4 comments

Required information

  • Distribution: Arch Linux
  • The output of "incus info":
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: c0rn3j
auth_user_method: unix
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICBzCCAY2gAwIBAgIRAJq+jJvvcUBYON1KPndOnUgwCgYIKoZIzj0EAwMwNTEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBMdXh1
    cmlhMB4XDTIxMDYyNzE0MjIyOVoXDTMxMDYyNTE0MjIyOVowNTEcMBoGA1UEChMT
    bGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBMdXh1cmlhMHYwEAYH
    KoZIzj0CAQYFK4EEACIDYgAElX7iyAw8q/fF9Qd1P5cu7r4UM6evd98hGZu1DAmN
    8EJsdcjSDheOSJWMwxz8DIihpCn2GmT16QCtjNsPJi/W/n38V0wJU8133xMYz2j1
    Ms7rdd3KypcJezCNCaGEFFnHo2EwXzAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAww
    CgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAqBgNVHREEIzAhggdMdXh1cmlhhwR/
    AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMGe6Htwpu5ab
    QZOEcB0H9sS7uMbdyY3NmNQco85vA7Rz8Sx3iGYuxpFNZ6U22iez3AIxAISoiLSX
    KarWaTT503kaM2csVqIN+TF8RzT0TO2cQNl8hJ3/seVt7onMX1C7xB7Qjw==
    -----END CERTIFICATE-----
  certificate_fingerprint: 85a907693fb60e7f3f48f98a97b5a3bcb3cf90f35b5c7027b9c5f4568122f313
  driver: lxc | qemu
  driver_version: 6.0.0 | 9.0.1
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.9.4-arch1-1
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Arch Linux
  os_version: ""
  project: default
  server: incus
  server_clustered: false
  server_event_mode: full-mesh
  server_name: Luxuria
  server_pid: 3876
  server_version: "6.2"
  storage: btrfs
  storage_version: "6.9"
  storage_supported_drivers:
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.24(2) (2024-05-16) / 1.02.198 (2024-05-16) / 4.48.0
    remote: false
  - name: lvmcluster
    version: 2.03.24(2) (2024-05-16) / 1.02.198 (2024-05-16) / 4.48.0
    remote: true
  - name: btrfs
    version: "6.9"
    remote: false

Issue description

c0rn3j@Luxuria : ~
[0] % incus config show ai            
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Archlinux current amd64 (20240425_04:43)
  image.os: Archlinux
  image.release: current
  image.requirements.secureboot: "false"
  image.serial: "20240425_04:43"
  image.type: squashfs
  image.variant: default
  nvidia.runtime: "true"
  volatile.base_image: 4f39fcabe30ee9c3a36da0f317ebd1d43a83d405edcad3c0d2be0ef868079e39
  volatile.cloud-init.instance-id: a44a0ce2-118a-4e05-a2fe-8f7c1f45b8fe
  volatile.eth0.host_name: veth190c1d08
  volatile.eth0.hwaddr: 00:16:3e:06:2c:96
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: bce2b402-db8a-4808-8aeb-27cb3457621c
  volatile.uuid.generation: bce2b402-db8a-4808-8aeb-27cb3457621c
devices:
  gpu:
    type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""

I have added a GPU to the container - this seems to work very sporadically, I think I notice this especially after a driver update and a host reboot - it does not seem to add back properly until I reboot the container perhaps?

Unsure yet how to actually reproduce.

Here's a demo of the broken container spurring back to life after a reboot:

c0rn3j@Luxuria : ~
[0] % incus exec ai -- zsh -c 'ls -lah /dev/nvi*'   
crw-rw-rw- 1 nobody nobody 195, 255 Jun 14 18:46 /dev/nvidiactl

c0rn3j@Luxuria : ~
[0] % incus restart ai    

c0rn3j@Luxuria : ~
[0] % incus exec ai -- zsh -c 'ls -lah /dev/nvi*'
crw-rw-rw- 1 nobody nobody 236,   0 Jun 14 18:46 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nobody 236,   1 Jun 14 18:46 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root   root   195,   0 Jun 18 12:45 /dev/nvidia0
crw-rw-rw- 1 nobody nobody 195, 255 Jun 14 18:46 /dev/nvidiactl

Information to attach

  • [ ] Any relevant kernel output (dmesg)
  • [x] Container log (incus info NAME --show-log)
  • [ ] Container configuration (incus config show NAME --expanded)
  • [ ] Main daemon log (at /var/log/incus/incusd.log)
  • [ ] Output of the client with --debug
  • [ ] Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)

C0rn3j avatar Jun 18 '24 10:06 C0rn3j

Could be some kind of race condition between the NVIDIA driver stuff loading and the container starting?

Can you maybe try boot.autostart=false on the container so it doesn't start when the system boots up and see if things then behave properly when you first incus start it?

stgraber avatar Jun 18 '24 16:06 stgraber

Interesting, I run into a similar issue with NVidia driver version 545.23.08.

After a reboot of the host all container that have a GPU added don't see it. After some investigation I figured out that the cuda environment isn't loaded. Running a simple cuda bandwidthtest before incus starts solves the issue. Wrote a small systemd service to run the bandwidthtest before incus service starts andall container starting full operational.

During my investigation I came across a post or similar where someone mentioned it might be an issue with the latest nvidia driver. Hope this gat'S solved at some stage.

defect-track avatar Aug 08 '24 09:08 defect-track

Is this still an active issue for anyone following this?

If so, can you provide some details on kernel version, NVIDIA version, GPU in use and in general what you've been doing?

I did see mention on the forum of something having changed on the NVIDIA front which somehow requires some kind of CUDA initialization, that would line up with what you're seeing by running the bandwidth test.

stgraber avatar Sep 25 '24 21:09 stgraber

This shipped on Arch a couple days before I reported this.
I might not have rebooted or mirrors could be not up to date to have the fix at the time of report.

I have also changed my setup to load some CUDA stuff in Docker on host.

One of those two things has fixed/worked around my issues.

Incus could carry such rules I suppose? Just make sure to use the fixed up version of that commit if so to not re-trigger it, see the latest file - https://gitlab.archlinux.org/archlinux/packaging/packages/nvidia-utils/-/blob/main/nvidia.rules?ref_type=heads

C0rn3j avatar Oct 04 '24 08:10 C0rn3j