blog
blog copied to clipboard
Proxmox 安装 NVIDIA 驱动 直通LXC 嵌套 docker 各种疑难杂症
基本操作
# 屏蔽nouveau 添加一句 blacklist nouveau
vim /etc/modprobe.d/blacklist.conf
# 修改生效
update-initramfs -u
# 重启
reboot
# 到nvidia官方下载对应驱动 给运行权限
chmod +x NVIDIA-Linux-x86_64-525.116.04.run
# 安装
./NVIDIA-Linux-x86_64-525.116.04.run
错误
- Error: the distribution-provided pre-install script failed.
- Error: Unable to find the development tool 'cc' in your path.
- Error: Unable to find the development tool 'make' in your path.
- Error: The kernel module failed to load. Secure boot is enabled on this system.
- The signed kernel module failed to load.
- Error: Unable to load the kernel module 'nvidia.ko'
- Error: An NVIDIA kernel 'nvidia-drm' appears to already be loaded in your kernel.
- Error: An NVIDIA kernel module 'nvidia-modeset' appears to already be loaded in your kernel.
- WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries.
- WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files.
解决
- 第一个错误,继续安装即可,这个错误只是确认你是否要安装这个驱动
- 第二个、第三个错误产生的原因是gcc和make没安装
apt-get install gcc
apt-get install make
- 第四个错误与第五个错误产生的原因是BIOS没有关闭 Secure boot
- 第六个错误,证明准备工作没有做好
- 第七个错误和第八个错误,首先要确保关闭了Secure Boot,然后删除已经安装的显卡驱动:
apt-get purge nvidia*
apt-get autoremove
reboot
安装后
添加两行到 /etc/modules-load.d/nvidia.conf
nvidia
nvidia-uvm
添加规则
新建 /etc/udev/rules.d/70-nvidia.rules 添加内容
# /etc/udev/rules.d/70-nvidia.rules
# Create /nvidia0, /dev/nvidia1 and /nvidiactl when nvidia module is loaded
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
# Create the CUDA node when nvidia_uvm CUDA module is loaded
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"
重启
LXC配置
参考如下 使用cgroup2添加对应设备
lxc.apparmor.profile: unconfined
lxc.cgroup.devices.allow: a
lxc.cap.drop:
lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 507:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
LXC
需要安装显卡驱动,选择 ./NVIDIA-Linux-x86_64-535.104.05.run --no-kernel-module 方式安装
LXC Docker
参考NVIDIA官方的安装手册 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html