nvidia GPU drivers fail to install on arm64 (g5g.xlarge on AWS)
Description
Unable to use the nvidia auto-install on an ARM64 machine (g5g.xlarge) on AWS. It appears to be attempting to use an x86 file.
Minor investigation shows the ebuild is tagged arm64, so I am not sure where the wires are getting crossed. Docs don't say anything about x86 exclusivity.
Impact
nvidia.service fails to install correctly
Environment and steps to reproduce
- Set-up:
g5g.xlargeon AWS.
$ cat /usr/share/flatcar/nvidia-metadata
NVIDIA_DRIVER_VERSION=535.104.05
NVIDIA_PRODUCT_TYPE=tesla
- Error:
$ journalctl -u nvidia
Feb 13 14:39:35 localhost systemd[1]: Starting nvidia.service - NVIDIA Configure Service...
Feb 13 14:39:35 localhost setup-nvidia[2556]: Downloading Flatcar Container Linux Developer Container for version: 3941.1.0
Feb 13 14:39:35 ip-172-40-20-6 setup-nvidia[2748]: % Total % Received % Xferd Average Speed Time Time Time Current
Feb 13 14:39:35 ip-172-40-20-6 setup-nvidia[2748]: Dload Upload Total Spent Left Speed
Feb 13 14:39:54 ip-172-40-20-6 setup-nvidia[2748]: [1.6K blob data]
Feb 13 14:40:50 ip-172-40-20-6 setup-nvidia[2556]: Downloading NVIDIA 535.104.05 Driver
Feb 13 14:40:50 ip-172-40-20-6 setup-nvidia[3034]: % Total % Received % Xferd Average Speed Time Time Time Current
Feb 13 14:40:50 ip-172-40-20-6 setup-nvidia[3034]: Dload Upload Total Spent Left Speed
Feb 13 14:40:52 ip-172-40-20-6 setup-nvidia[3034]: [395B blob data]
Feb 13 14:40:52 ip-172-40-20-6 setup-nvidia[2556]: Extract the NVIDIA Driver Installer 535.104.05
Feb 13 14:40:52 ip-172-40-20-6 setup-nvidia[2556]: /opt/nvidia/workdir/nvidia-workdir /
Feb 13 14:40:52 ip-172-40-20-6 setup-nvidia[3037]: Creating directory NVIDIA-Linux-x86_64-535.104.05
Feb 13 14:40:52 ip-172-40-20-6 setup-nvidia[3037]: Verifying archive integrity... OK
Feb 13 14:40:53 ip-172-40-20-6 setup-nvidia[3037]: Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.104.05
Feb 13 14:40:58 ip-172-40-20-6 setup-nvidia[3067]: ...................................................................................................>
Feb 13 14:40:58 ip-172-40-20-6 setup-nvidia[2556]: /
Feb 13 14:40:58 ip-172-40-20-6 setup-nvidia[2556]: Spawn system-nspawn container to install the NVIDIA drivers
Feb 13 14:40:58 ip-172-40-20-6 sudo[3084]: root : PWD=/ ; USER=root ; COMMAND=/usr/bin/systemd-nspawn --read-only --volatile=overlay --image=/opt/>
Feb 13 14:40:58 ip-172-40-20-6 sudo[3084]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Feb 13 14:41:14 ip-172-40-20-6 setup-nvidia[3143]: cp: cannot stat '/opt/nvidia/workdir/nvidia-workdir/NVIDIA-Linux-x86_64-535.104.05/install-mod/*.ko>
Feb 13 14:41:14 ip-172-40-20-6 systemd[1]: nvidia.service: Main process exited, code=exited, status=1/FAILURE
Feb 13 14:41:14 ip-172-40-20-6 systemd[1]: nvidia.service: Failed with result 'exit-code'.
Feb 13 14:41:14 ip-172-40-20-6 systemd[1]: Failed to start nvidia.service - NVIDIA Configure Service.
Feb 13 14:41:14 ip-172-40-20-6 systemd[1]: nvidia.service: Consumed 1min 12.881s CPU time.
$ nvidia-smi
-bash: /opt/bin/nvidia-smi: cannot execute binary file: Exec format error
Expected behavior
GPU drivers are installed
Same result with the latest stable 4152.2.0:
Feb 13 18:39:22 ip-172-40-34-224 setup-nvidia[2596]: cp: cannot stat '/opt/nvidia/workdir/nvidia-workdir/NVIDIA-Linux-x86_64-535.216.01/install-mod/*.ko': No such file or directory
Hi @robszumski sorry for the delay!
NVIDIA ARM64 support for AWS is now available on all channels (Alpha, Beta and Stable) - feel free to give a try and provide feedback.
Sorry again for the delay.
This is done and available on all main channels, I will go ahead and close this issue.