initialization-actions
initialization-actions copied to clipboard
gpu driver installer fails on 2.1 image with cuda=11.5
export ACCELERATOR_TYPE="nvidia-tesla-t4"
export CUDA_VERSION=11.5
export MACHINE_TYPE=n1-standard-1
export IMAGE_VERSION=2.1
date
time gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--zone ${ZONE} \
--subnet ${SUBNET} \
--no-address \
--service-account=${GSA} \
--master-machine-type ${MACHINE_TYPE} \
--worker-machine-type ${MACHINE_TYPE} \
--master-boot-disk-type pd-standard \
--master-boot-disk-size 1024 \
--image-version ${IMAGE_VERSION} \
--tags=${TAGS} \
--bucket ${BUCKET} \
--initialization-action-timeout=15m \
--max-idle=${IDLE_TIMEOUT} \
--enable-component-gateway \
--metadata include-gpus=true \
--worker-accelerator type=${ACCELERATOR_TYPE} \
--master-accelerator type=${ACCELERATOR_TYPE} \
--metadata gpu-driver-provider=NVIDIA \
--initialization-actions ${INIT_ACTIONS_ROOT}/gpu/install_gpu_driver.sh \
--metadata init-actions-repo=${INIT_ACTIONS_ROOT} \
--metadata install-gpu-agent=true \
--metadata cuda-version=${CUDA_VERSION} \
--scopes 'https://www.googleapis.com/auth/cloud-platform'
date
Cluster creation fails with
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cjac-2021-00/regions/us-central1/operations/bc05b7c9-4ab6-32f9-afb9-d39e2415fe52] failed: Multiple Errors:
- Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-m/dataproc-initialization-script-1_output
- Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-0/dataproc-initialization-script-1_output
- Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-1/dataproc-initialization-script-1_output.
real 3m55.995s
user 0m1.388s
sys 0m0.117s
+ date
Wed May 24 04:10:14 PM PDT 2023
Kernel driver build fails because of failure to sign the driver:
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 495.29.05.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ERROR: The kernel module failed to load. Secure boot is enabled on this system, so this is likely because it was not signed by a key that is trusted by the kernel. Please try installing the driver again, and sign the kernel module when prompted to do so.
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.