blog icon indicating copy to clipboard operation
blog copied to clipboard

Proxmox 安装 NVIDIA 驱动 直通LXC 嵌套 docker 各种疑难杂症

Open luckyyyyy opened this issue 2 years ago • 0 comments

基本操作

# 屏蔽nouveau 添加一句 blacklist nouveau
vim /etc/modprobe.d/blacklist.conf
# 修改生效
update-initramfs -u
# 重启
reboot
# 到nvidia官方下载对应驱动 给运行权限
chmod +x NVIDIA-Linux-x86_64-525.116.04.run 
# 安装
./NVIDIA-Linux-x86_64-525.116.04.run 

错误

  1. Error: the distribution-provided pre-install script failed.
  2. Error: Unable to find the development tool 'cc' in your path.
  3. Error: Unable to find the development tool 'make' in your path.
  4. Error: The kernel module failed to load. Secure boot is enabled on this system.
  5. The signed kernel module failed to load.
  6. Error: Unable to load the kernel module 'nvidia.ko'
  7. Error: An NVIDIA kernel 'nvidia-drm' appears to already be loaded in your kernel.
  8. Error: An NVIDIA kernel module 'nvidia-modeset' appears to already be loaded in your kernel.
  9. WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries.
  10. WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files.

解决

  • 第一个错误,继续安装即可,这个错误只是确认你是否要安装这个驱动
  • 第二个、第三个错误产生的原因是gcc和make没安装
apt-get install gcc
apt-get install make
  • 第四个错误与第五个错误产生的原因是BIOS没有关闭 Secure boot
  • 第六个错误,证明准备工作没有做好
  • 第七个错误和第八个错误,首先要确保关闭了Secure Boot,然后删除已经安装的显卡驱动:
apt-get purge nvidia*
apt-get autoremove
reboot

安装后

添加两行到 /etc/modules-load.d/nvidia.conf

nvidia
nvidia-uvm

添加规则

新建 /etc/udev/rules.d/70-nvidia.rules 添加内容

# /etc/udev/rules.d/70-nvidia.rules
# Create /nvidia0, /dev/nvidia1 and /nvidiactl when nvidia module is loaded
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
# Create the CUDA node when nvidia_uvm CUDA module is loaded
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

重启

LXC配置

参考如下 使用cgroup2添加对应设备

lxc.apparmor.profile: unconfined
lxc.cgroup.devices.allow: a
lxc.cap.drop: 
lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 507:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

LXC

需要安装显卡驱动,选择 ./NVIDIA-Linux-x86_64-535.104.05.run --no-kernel-module 方式安装

image

LXC Docker

参考NVIDIA官方的安装手册 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

luckyyyyy avatar May 26 '23 22:05 luckyyyyy