coreos-nvidia
coreos-nvidia copied to clipboard
Final installation path instructions and tips
on CoreOS, /lib64/, /usr/lib64/ and co. all reside on a read-only filesystem. You might need to create a new directory elsewhere and its location listed in a file under
/etc/ld.so.conf.d/
- There is no
/etc/ld.so.conf.d
in 1298.5.0, only/etc/ld.so.conf
is a symlink to/usr/lib/ld.so.conf
which is mounted in readonly filesystem
There was a similar problem was with .bashrc
! the solution is to remove the link and replace it with the actual file)
I created a /opt/bin
and /opt/lib64
but ldconfig complained about libEGL.so.1
which I made it s symlink
sudo ln -sfT libEGL.so.375.20 /opt/lib64/libEGL.so.1
- What about the kernel modules?
sudo mkdir -p /opt/lib64/modules/4.9.9-coreos-r1/kernel
But modprobe
will not be able to find them, because it is non-standard path (insmod
should work).
I wonder if it is simpler to re-mount the drive as rw
and copy the kernel modules and all, rather than trying to install in /opt
and work around issues.
creating device nodes under /dev/
Do you mean after modprobe
or perhaps insmod
?
From [here](https://groups.google.com/forum/#!topic/coreos-user/s9Sy_gBl94o):
> Kernel modules can be installed under `/opt/lib/modules/$(uname -r)/` as long as
you use `--dirname /opt` when calling modprobe.
So now I only need to use udev, otherwise will have to resort to [manually adding the nodes](https://gist.githubusercontent.com/tleyden/74f593a0beea300de08c/raw/95ed93c5751a989e58153db6f88c35515b7af120/nvidia_devices.sh) (,[same](http://askubuntu.com/a/748905/103961))
You can create /etc/ld.so.conf.d. It's not there by default and /etc is R/W.
I've never seen problems with libEGL.
The kernel modules could go in a variety of places: /etc/modules, /opt/lib/modules or even a R/W overlay over /lib/modules. I use insmod directly. And yes, device nodes should be created after the modules are loaded.
I'll try to add more details. I didn't add them before, because different people have different needs and toolchains.
I tried to emulate Ubuntu in loading nvidia
-
created a udev rules file:
$ cat /etc/udev/rules.d/71-nvidia.rules
# Tag the device as master-of-seat so that logind is happy
# (see LP: #1365336)
SUBSYSTEM=="pci", ATTRS{vendor}=="0x10de", DRIVERS=="nvidia", TAG+="seat", TAG+="master-of-seat"
# Start and stop nvidia-persistenced on power on and power off
# respectively
ACTION=="add" DEVPATH=="/bus/acpi/drivers/NVIDIA ACPI Video Driver" SUBSYSTEM=="drivers" RUN+="/bin/systemctl start --no-block nvidia-persistenced.service"
ACTION=="remove" DEVPATH=="/bus/acpi/drivers/NVIDIA ACPI Video Driver" SUBSYSTEM=="drivers" RUN+="/bin/systemctl stop --no-block nvidia-persistenced"
# Start and stop nvidia-persistenced when loading and unloading
# the driver
ACTION=="add" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/bin/systemctl start --no-block nvidia-persistenced.service"
ACTION=="remove" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/bin/systemctl stop --no-block nvidia-persistenced"
# Load and unload nvidia-modeset module
ACTION=="add" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/usr/sbin/insmod /opt/lib/modules/4.9.9-coreos-r1/video/nvidia-modeset.ko"
ACTION=="remove" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/usr/sbin/rmmod -r nvidia-modeset"
# Load and unload nvidia-drm module
ACTION=="add" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/usr/sbin/insmod /opt/lib/modules/4.9.9-coreos-r1/video/nvidia-drm.ko"
ACTION=="remove" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/usr/sbin/rmmod nvidia-drm"
# Load and unload nvidia-uvm module
ACTION=="add" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/usr/sbin/insmod /opt/lib/modules/4.9.9-coreos-r1/video/nvidia-uvm.ko"
ACTION=="remove" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/usr/sbin/rmmod -r nvidia-uvm"
# This will create the device nvidia device nodes
ACTION=="add" DEVPATH=="/module/nvidia" SUBSYSTEM=="module" RUN+="/opt/bin/nvidia-smi"
# Create the device node for the nvidia-uvm module
ACTION=="add" DEVPATH=="/module/nvidia_uvm" SUBSYSTEM=="module" RUN+="/opt/bin/create-uvm-dev-node.sh"
-
With this extra script in
/opt/bin
:$ cat cat /opt/bin/create-uvm-dev-node.sh
#!/bin/sh
# Get the major device number for nvidia-uvm and create the node
major=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
if [ -n "$major" ]; then
mknod -m 666 /dev/nvidia-uvm c $major 0
fi
-
Created the user:
useradd --system --home '/' --shell '/sbin/nologin' -c 'NVIDIA Persistence Daemon' nvidia-persistenced
Now I can just insmod the nvidia.ko
and it loads everything needed:
sudo insmod /opt/lib64/modules/4.9.9-coreos-r1/video/nvidia.ko
I am thinking of adding another rule to do that, will update if I could get it working. For now I have another service that I run that does the required insmod:
$ cat nvidia-start.service
[Unit]
Description=Load NVIDIA module
[Service]
ExecStart=/usr/sbin/insmod /opt/lib/modules/4.9.9-coreos-r1/video/nvidia.ko
[Install]
WantedBy=multi-user.target
I am using a CoreOS system and want to install nvidia-docker on it. The prerequisite to which is having the nvidia drivers. I have run build.sh
and have the three archives. I am a total noob here, could you please help me on how to proceed to set it up?
I have followed the steps that @dashesy has mentioned in the last comment.
So right now, my /opt/bin
contains all the .so
files from the libraries directory and files like nvidia-smi
, nvidia-settings
from tools directory. The /opt/lib/modules/4.9.16-coreos-r1/video
contains the .ko
files.
But running the nvidia-smi command gives the following output
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
Please help. Thanks in advance.
@DevipriyaSarkar did you try the PR #4 (and its nvidia_install and nvidia_docker_install) ? Also, you need to make sure nvidia-uvm
is not loaded when you call build.sh
or otherwise the install scripts bails out with unfinished kernel modules (disable nvidia-start
, reboot and call build.sh
again if that is the case).
Also, if you already had nvidia-docker
installed before running this, you may need to rebuild the volume to get the hardlinks in the volume (you can inspect what the volume contains in /var/lib/nvidia-docker/volumes
).