initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

GPU driver installation fails on 2.2.52-debian12

Open zafercavdar opened this issue 10 months ago • 2 comments

I receive the following error:

nvidia-smi not installed
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0: line 293: is_cudnn8: command not found
Files removed: 308 (2317.2 MB)
Writing to /root/.config/pip/pip.conf
gpg: keybox '/usr/share/keyrings/adoptium.gpg' created
gpg: directory '/root/.gnupg' created
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 843C48A565F8F04B: public key "Adoptium GPG Key (DEB/RPM Signing Key) <[email protected]>" imported
gpg: Total number processed: 1
gpg:               imported: 1
gpg: key C0BA5CE6DC6315A3: public key "Artifact Registry Repository Signer <[email protected]>" imported
gpg: Total number processed: 1
gpg:               imported: 1
gpg: keybox '/usr/share/keyrings/docker-keyring.gpg' created
gpg: key 8D81803C0EBFCD88: public key "Docker Release (CE deb) <[email protected]>" imported
gpg: Total number processed: 1
gpg:               imported: 1
/etc/apt/sources.list.d/google-cloud.list
gpg: keybox '/usr/share/keyrings/cloud.google.gpg' created
gpg: key C0BA5CE6DC6315A3: public key "Artifact Registry Repository Signer <[email protected]>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Reading package lists...
Building dependency tree...
Reading state information...
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.
Canceled hold on systemd.
Canceled hold on libsystemd0.

real	0m8.038s
user	0m4.309s
sys	0m1.458s
nvidia-smi not installed
acl:
- entity: project-owners-****************
  projectTeam:
    projectNumber: '****************'
    team: owners
  role: OWNER
- entity: project-editors-****************
  projectTeam:
    projectNumber: '****************'
    team: editors
  role: OWNER
- entity: project-viewers-****************
  projectTeam:
    projectNumber: '****************'
    team: viewers
  role: READER
- email: ****************[email protected]
  entity: user-****************[email protected]
  role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
component_count: 7
content_type: application/octet-stream
crc32c_hash: pOhoiw==
creation_time: 2025-04-22T07:38:19+0000
etag: CPzuiYyR64wDEAE=
generation: '1745307499132796'
metageneration: 1
name: dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run
size: 307296728
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:38:19+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run#1745307499132796
update_time: 2025-04-22T07:38:19+0000
Copying gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run to file:///mnt/shm/userspace.run
  
....

Average throughput: 691.9MiB/s

real	0m2.103s
user	0m2.386s
sys	0m1.030s

real	0m20.568s
user	0m5.847s
sys	0m5.105s
/opt/install-dpgce /
acl:
- entity: project-owners-****************
  projectTeam:
    projectNumber: '****************'
    team: owners
  role: OWNER
- entity: project-editors-****************
  projectTeam:
    projectNumber: '****************'
    team: editors
  role: OWNER
- entity: project-viewers-****************
  projectTeam:
    projectNumber: '****************'
    team: viewers
  role: READER
- email: ****************[email protected]
  entity: user-****************[email protected]
  role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
content_type: application/x-tar
crc32c_hash: hUkg3A==
creation_time: 2025-04-22T07:40:18+0000
etag: CKHpisWR64wDEAE=
generation: '1745307618686113'
md5_hash: u5lHOXdDD2qH/CYP1wGVxw==
metageneration: 1
name: dpgce-packages/nvidia/kmod/debian12/6.1.0-32-cloud-amd64/unsigned/kmod_debian12_550.142.tar.gz
size: 25508565
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:40:18+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/kmod/debian12/6.1.0-32-cloud-amd64/unsigned/kmod_debian12_550.142.tar.gz#1745307618686113
update_time: 2025-04-22T07:40:18+0000
cache hit
opt/install-dpgce/open-gpu-kernel-modules/kernel-open/build.log
opt/install-dpgce/open-gpu-kernel-modules/kernel-open/build_error.log
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-uvm.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-drm.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-peermem.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-modeset.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia.ko
/
NVIDIA GPU driver provided by NVIDIA was installed successfully
acl:
- entity: project-owners-****************
  projectTeam:
    projectNumber: '****************'
    team: owners
  role: OWNER
- entity: project-editors-****************
  projectTeam:
    projectNumber: '****************'
    team: editors
  role: OWNER
- entity: project-viewers-****************
  projectTeam:
    projectNumber: '****************'
    team: viewers
  role: READER
- email: ****************[email protected]
  entity: user-****************[email protected]
  role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
component_count: 32
content_type: application/octet-stream
crc32c_hash: ROiILQ==
creation_time: 2025-04-22T07:54:44+0000
etag: CL+w9OGU64wDEAE=
generation: '1745308484442175'
metageneration: 1
name: dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run
size: 4446722669
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:54:44+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run#1745308484442175
update_time: 2025-04-22T07:54:44+0000
Copying gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run to file:///mnt/shm/cuda.run
  
.................

Average throughput: 1.3GiB/s

real	0m4.643s
user	0m16.077s
sys	0m20.338s

real	2m39.479s
user	2m19.921s
sys	0m48.780s
Selecting previously unselected package cuda-keyring.
(Reading database ... 166259 files and directories currently installed.)
Preparing to unpack /mnt/shm/cuda-keyring.deb ...
Unpacking cuda-keyring (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
unable to rmmod nvidia_uvm
unable to rmmod nvidia_drm
unable to rmmod nvidia_modeset
unable to rmmod nvidia
/opt/install-dpgce /
ERROR: (gcloud.storage.objects.describe) gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages%2Fnvidia%2Fnccl%2Fdebian12%2Fnccl-build_debian12_2.23.4-1%2Bcuda12.6.tar.gz not found: 404.
Copying file:///opt/install-dpgce/nccl-build_debian12_2.23.4-1+cuda12.6.tar.gz.building to gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/nccl/debian12/nccl-build_debian12_2.23.4-1+cuda12.6.tar.gz.building
  

/opt/install-dpgce/nccl /opt/install-dpgce /

real	0m57.433s
user	0m14.465s
sys	0m5.950s

It cannot find gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages%2Fnvidia%2Fnccl%2Fdebian12%2Fnccl-build_debian12_2.23.4-1%2Bcuda12.6.tar.gz file however that file exists in the storage.

Image

zafercavdar avatar Apr 22 '25 14:04 zafercavdar

strange. I've built from scratch on 2.2 and 2.3 very recently.

If you can increase the size of your VM to 32 cores, use a single instance cluster, and create a new cluster using the init action, it should be able to re-genrate the nccl build and write it to the cache location in ~15 minutes.

I'm sorry that I'm not able to reproduce your issue.

I have a new version of the script coming out shortly. Here, I'll create a pull request and you can try this one, perhaps.

https://github.com/GoogleCloudDataproc/initialization-actions/pull/1320

cjac avatar May 04 '25 05:05 cjac

@zafercavdar - have you been able to confirm that the script from #1320 resolves your issue?

Here is a direct link to that file:

https://github.com/GoogleCloudDataproc/initialization-actions/raw/6d00e017e6c46a91d646cb3ea32c78925a3f7474/gpu/install_gpu_driver.sh

cjac avatar May 09 '25 06:05 cjac