GPU driver installation fails on 2.2.52-debian12
I receive the following error:
nvidia-smi not installed
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0: line 293: is_cudnn8: command not found
Files removed: 308 (2317.2 MB)
Writing to /root/.config/pip/pip.conf
gpg: keybox '/usr/share/keyrings/adoptium.gpg' created
gpg: directory '/root/.gnupg' created
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 843C48A565F8F04B: public key "Adoptium GPG Key (DEB/RPM Signing Key) <[email protected]>" imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: key C0BA5CE6DC6315A3: public key "Artifact Registry Repository Signer <[email protected]>" imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: keybox '/usr/share/keyrings/docker-keyring.gpg' created
gpg: key 8D81803C0EBFCD88: public key "Docker Release (CE deb) <[email protected]>" imported
gpg: Total number processed: 1
gpg: imported: 1
/etc/apt/sources.list.d/google-cloud.list
gpg: keybox '/usr/share/keyrings/cloud.google.gpg' created
gpg: key C0BA5CE6DC6315A3: public key "Artifact Registry Repository Signer <[email protected]>" imported
gpg: Total number processed: 1
gpg: imported: 1
Reading package lists...
Building dependency tree...
Reading state information...
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.
Canceled hold on systemd.
Canceled hold on libsystemd0.
real 0m8.038s
user 0m4.309s
sys 0m1.458s
nvidia-smi not installed
acl:
- entity: project-owners-****************
projectTeam:
projectNumber: '****************'
team: owners
role: OWNER
- entity: project-editors-****************
projectTeam:
projectNumber: '****************'
team: editors
role: OWNER
- entity: project-viewers-****************
projectTeam:
projectNumber: '****************'
team: viewers
role: READER
- email: ****************[email protected]
entity: user-****************[email protected]
role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
component_count: 7
content_type: application/octet-stream
crc32c_hash: pOhoiw==
creation_time: 2025-04-22T07:38:19+0000
etag: CPzuiYyR64wDEAE=
generation: '1745307499132796'
metageneration: 1
name: dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run
size: 307296728
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:38:19+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run#1745307499132796
update_time: 2025-04-22T07:38:19+0000
Copying gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run to file:///mnt/shm/userspace.run
....
Average throughput: 691.9MiB/s
real 0m2.103s
user 0m2.386s
sys 0m1.030s
real 0m20.568s
user 0m5.847s
sys 0m5.105s
/opt/install-dpgce /
acl:
- entity: project-owners-****************
projectTeam:
projectNumber: '****************'
team: owners
role: OWNER
- entity: project-editors-****************
projectTeam:
projectNumber: '****************'
team: editors
role: OWNER
- entity: project-viewers-****************
projectTeam:
projectNumber: '****************'
team: viewers
role: READER
- email: ****************[email protected]
entity: user-****************[email protected]
role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
content_type: application/x-tar
crc32c_hash: hUkg3A==
creation_time: 2025-04-22T07:40:18+0000
etag: CKHpisWR64wDEAE=
generation: '1745307618686113'
md5_hash: u5lHOXdDD2qH/CYP1wGVxw==
metageneration: 1
name: dpgce-packages/nvidia/kmod/debian12/6.1.0-32-cloud-amd64/unsigned/kmod_debian12_550.142.tar.gz
size: 25508565
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:40:18+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/kmod/debian12/6.1.0-32-cloud-amd64/unsigned/kmod_debian12_550.142.tar.gz#1745307618686113
update_time: 2025-04-22T07:40:18+0000
cache hit
opt/install-dpgce/open-gpu-kernel-modules/kernel-open/build.log
opt/install-dpgce/open-gpu-kernel-modules/kernel-open/build_error.log
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-uvm.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-drm.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-peermem.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-modeset.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia.ko
/
NVIDIA GPU driver provided by NVIDIA was installed successfully
acl:
- entity: project-owners-****************
projectTeam:
projectNumber: '****************'
team: owners
role: OWNER
- entity: project-editors-****************
projectTeam:
projectNumber: '****************'
team: editors
role: OWNER
- entity: project-viewers-****************
projectTeam:
projectNumber: '****************'
team: viewers
role: READER
- email: ****************[email protected]
entity: user-****************[email protected]
role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
component_count: 32
content_type: application/octet-stream
crc32c_hash: ROiILQ==
creation_time: 2025-04-22T07:54:44+0000
etag: CL+w9OGU64wDEAE=
generation: '1745308484442175'
metageneration: 1
name: dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run
size: 4446722669
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:54:44+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run#1745308484442175
update_time: 2025-04-22T07:54:44+0000
Copying gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run to file:///mnt/shm/cuda.run
.................
Average throughput: 1.3GiB/s
real 0m4.643s
user 0m16.077s
sys 0m20.338s
real 2m39.479s
user 2m19.921s
sys 0m48.780s
Selecting previously unselected package cuda-keyring.
(Reading database ... 166259 files and directories currently installed.)
Preparing to unpack /mnt/shm/cuda-keyring.deb ...
Unpacking cuda-keyring (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
unable to rmmod nvidia_uvm
unable to rmmod nvidia_drm
unable to rmmod nvidia_modeset
unable to rmmod nvidia
/opt/install-dpgce /
ERROR: (gcloud.storage.objects.describe) gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages%2Fnvidia%2Fnccl%2Fdebian12%2Fnccl-build_debian12_2.23.4-1%2Bcuda12.6.tar.gz not found: 404.
Copying file:///opt/install-dpgce/nccl-build_debian12_2.23.4-1+cuda12.6.tar.gz.building to gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/nccl/debian12/nccl-build_debian12_2.23.4-1+cuda12.6.tar.gz.building
/opt/install-dpgce/nccl /opt/install-dpgce /
real 0m57.433s
user 0m14.465s
sys 0m5.950s
It cannot find gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages%2Fnvidia%2Fnccl%2Fdebian12%2Fnccl-build_debian12_2.23.4-1%2Bcuda12.6.tar.gz file however that file exists in the storage.
strange. I've built from scratch on 2.2 and 2.3 very recently.
If you can increase the size of your VM to 32 cores, use a single instance cluster, and create a new cluster using the init action, it should be able to re-genrate the nccl build and write it to the cache location in ~15 minutes.
I'm sorry that I'm not able to reproduce your issue.
I have a new version of the script coming out shortly. Here, I'll create a pull request and you can try this one, perhaps.
https://github.com/GoogleCloudDataproc/initialization-actions/pull/1320
@zafercavdar - have you been able to confirm that the script from #1320 resolves your issue?
Here is a direct link to that file:
https://github.com/GoogleCloudDataproc/initialization-actions/raw/6d00e017e6c46a91d646cb3ea32c78925a3f7474/gpu/install_gpu_driver.sh