coreos-nvidia-driver
coreos-nvidia-driver copied to clipboard
Tried to install this container but seem to have directory issue
Running CoreOs as a VM on XCP-NG Xen server with full device PCI passthrough. GPU is a Dell branded Nvidia Quadro P400.
When I run docker run --name=nvidia-drivers -v /:/rootfs --privileged bugroger/coreos-nvidia-driver:2135.5.0-390.77-geforce, I get:
+ ROOT_MOUNT_DIR=/root
+ NVIDIA_DRIVER_VERSION=390.77
+ NVIDIA_DRIVER_COREOS_VERSION=2135.5.0
+ NVIDIA_PRODUCT_TYPE=geforce
+ [[ ! -f /root/etc/os-release ]]
+ error \'File /root/etc/os-release not found, /etc/os-release must be mounted into this container.\'
/install.sh: line 20: error: command not found
So I changed docker run cmd to "-v /:/root" and that seemed to work but when testing for nvidia-smi: nvidia-smi -bash: nvidia-smi: command not found
So pretty sure something not right.
Update: OK, so I realised I was getting the loading module issue reported here and after applying @rikatz modprobe fix, my docker build gets past the insert module issue (although not persistent after reboot), but fails with:
Unable to determine the device handle for GPU 0000:00:05.0: Unknown Error
+ umount /lib/modules/4.19.50-coreos-r1/video
+ umount /usr/lib/x86_64-linux-gnu
+ umount /usr/bin
Is this related to my device? I can see:
[INFO 2019-07-22 12:20:24 UTC] Driver compatible! NVIDIA 390.77 (geforce) compiled for CoreOS 2135.5.0
further up in the log. Checking for nvidia:
lsmod | grep -i nvidia
nvidia_modeset 1110016 0
nvidia_drm 16384 0
nvidia_uvm 884736 0
nvidia 14393344 2 nvidia_uvm,nvidia_modeset
ipmi_msghandler 57344 2 ipmi_devintf,nvidia
i2c_core 61440 3 nvidia,psmouse,i2c_piix4
but nvidia-smi still unknown in system
I get this in coreos with dmesg
[ 1076.714161] nvidia: module license 'NVIDIA' taints kernel.
[ 1076.718835] Disabling lock debugging due to kernel taint
[ 1076.727040] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1076.737696] nvidia: Unknown symbol ipmi_create_user (err -2)
[ 1076.744516] nvidia: Unknown symbol ipmi_destroy_user (err -2)
[ 1076.749470] nvidia: Unknown symbol ipmi_validate_addr (err -2)
[ 1076.754442] nvidia: Unknown symbol ipmi_free_recv_msg (err -2)
[ 1076.759226] nvidia: Unknown symbol ipmi_set_my_address (err -2)
[ 1076.764128] nvidia: Unknown symbol ipmi_request_settime (err -2)
[ 1076.769042] nvidia: Unknown symbol ipmi_set_gets_events (err -2)
and then this further down...
[ 1094.540821] xen: --> pirq=16 -> irq=36 (gsi=36)
[ 1094.541314] nvidia 0000:00:05.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1094.549301] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 390.77 Tue Jul 10 18:28:52 PDT 2018 (using threaded interrupts)
[ 1095.011419] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
[ 1095.056452] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 390.77 Tue Jul 10 22:10:46 PDT 2018
[ 1095.107396] NVRM: RmInitAdapter failed! (0x23:0x56:470)
[ 1095.112302] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 1095.129106] NVRM: RmInitAdapter failed! (0x23:0x56:470)
[ 1095.133799] NVRM: rm_init_adapter failed for device bearing minor number 0
I also tested the stable-396.44-tesla drivers and get the same issue.
I'm not sure if this a driver issue or XenServer, I've tried so many things but nothing has worked yet
...tumbleweed...