coreos-nvidia-driver icon indicating copy to clipboard operation
coreos-nvidia-driver copied to clipboard

Tried to install this container but seem to have directory issue

Open Bodge-IT opened this issue 5 years ago • 2 comments

Running CoreOs as a VM on XCP-NG Xen server with full device PCI passthrough. GPU is a Dell branded Nvidia Quadro P400.

When I run docker run --name=nvidia-drivers -v /:/rootfs --privileged bugroger/coreos-nvidia-driver:2135.5.0-390.77-geforce, I get:

+ ROOT_MOUNT_DIR=/root
+ NVIDIA_DRIVER_VERSION=390.77
+ NVIDIA_DRIVER_COREOS_VERSION=2135.5.0
+ NVIDIA_PRODUCT_TYPE=geforce
+ [[ ! -f /root/etc/os-release ]]
+ error \'File /root/etc/os-release not found, /etc/os-release must be mounted into this container.\'
/install.sh: line 20: error: command not found

So I changed docker run cmd to "-v /:/root" and that seemed to work but when testing for nvidia-smi: nvidia-smi -bash: nvidia-smi: command not found

So pretty sure something not right.

Update: OK, so I realised I was getting the loading module issue reported here and after applying @rikatz modprobe fix, my docker build gets past the insert module issue (although not persistent after reboot), but fails with:

Unable to determine the device handle for GPU 0000:00:05.0: Unknown Error
+ umount /lib/modules/4.19.50-coreos-r1/video
+ umount /usr/lib/x86_64-linux-gnu
+ umount /usr/bin

Is this related to my device? I can see: [INFO 2019-07-22 12:20:24 UTC] Driver compatible! NVIDIA 390.77 (geforce) compiled for CoreOS 2135.5.0 further up in the log. Checking for nvidia:

lsmod | grep -i nvidia
nvidia_modeset       1110016  0
nvidia_drm             16384  0
nvidia_uvm            884736  0
nvidia              14393344  2 nvidia_uvm,nvidia_modeset
ipmi_msghandler        57344  2 ipmi_devintf,nvidia
i2c_core               61440  3 nvidia,psmouse,i2c_piix4

but nvidia-smi still unknown in system

I get this in coreos with dmesg

[ 1076.714161] nvidia: module license 'NVIDIA' taints kernel.
[ 1076.718835] Disabling lock debugging due to kernel taint
[ 1076.727040] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1076.737696] nvidia: Unknown symbol ipmi_create_user (err -2)
[ 1076.744516] nvidia: Unknown symbol ipmi_destroy_user (err -2)
[ 1076.749470] nvidia: Unknown symbol ipmi_validate_addr (err -2)
[ 1076.754442] nvidia: Unknown symbol ipmi_free_recv_msg (err -2)
[ 1076.759226] nvidia: Unknown symbol ipmi_set_my_address (err -2)
[ 1076.764128] nvidia: Unknown symbol ipmi_request_settime (err -2)
[ 1076.769042] nvidia: Unknown symbol ipmi_set_gets_events (err -2)

and then this further down...

[ 1094.540821] xen: --> pirq=16 -> irq=36 (gsi=36)
[ 1094.541314] nvidia 0000:00:05.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1094.549301] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.77  Tue Jul 10 18:28:52 PDT 2018 (using threaded interrupts)
[ 1095.011419] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
[ 1095.056452] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.77  Tue Jul 10 22:10:46 PDT 2018
[ 1095.107396] NVRM: RmInitAdapter failed! (0x23:0x56:470)
[ 1095.112302] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 1095.129106] NVRM: RmInitAdapter failed! (0x23:0x56:470)
[ 1095.133799] NVRM: rm_init_adapter failed for device bearing minor number 0

I also tested the stable-396.44-tesla drivers and get the same issue.

Bodge-IT avatar Jul 22 '19 11:07 Bodge-IT

I'm not sure if this a driver issue or XenServer, I've tried so many things but nothing has worked yet

Bodge-IT avatar Jul 29 '19 11:07 Bodge-IT

...tumbleweed...

Bodge-IT avatar Aug 06 '19 06:08 Bodge-IT