initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

gpu driver installer fails on 2.1 image with cuda=11.5

Open cjac opened this issue 1 year ago • 4 comments

export ACCELERATOR_TYPE="nvidia-tesla-t4"
export CUDA_VERSION=11.5
export MACHINE_TYPE=n1-standard-1
export IMAGE_VERSION=2.1


  date
  time gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --zone ${ZONE} \
    --subnet ${SUBNET} \
    --no-address \
    --service-account=${GSA} \
    --master-machine-type ${MACHINE_TYPE} \
    --worker-machine-type ${MACHINE_TYPE} \
    --master-boot-disk-type pd-standard \
    --master-boot-disk-size 1024 \
    --image-version ${IMAGE_VERSION} \
    --tags=${TAGS} \
    --bucket ${BUCKET} \
    --initialization-action-timeout=15m \
    --max-idle=${IDLE_TIMEOUT} \
    --enable-component-gateway \
    --metadata include-gpus=true \
    --worker-accelerator type=${ACCELERATOR_TYPE} \
    --master-accelerator type=${ACCELERATOR_TYPE} \
    --metadata gpu-driver-provider=NVIDIA \
    --initialization-actions ${INIT_ACTIONS_ROOT}/gpu/install_gpu_driver.sh \
    --metadata init-actions-repo=${INIT_ACTIONS_ROOT} \
    --metadata install-gpu-agent=true \
    --metadata cuda-version=${CUDA_VERSION} \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'
  date

Cluster creation fails with

Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cjac-2021-00/regions/us-central1/operations/bc05b7c9-4ab6-32f9-afb9-d39e2415fe52] failed: Multiple Errors:
 - Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-m/dataproc-initialization-script-1_output
 - Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-0/dataproc-initialization-script-1_output
 - Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-1/dataproc-initialization-script-1_output.

real    3m55.995s
user    0m1.388s
sys     0m0.117s
+ date
Wed May 24 04:10:14 PM PDT 2023

Kernel driver build fails because of failure to sign the driver:

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 495.29.05.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: The kernel module failed to load. Secure boot is enabled on this system, so this is likely because it was not signed by a key that is trusted by the kernel. Please try installing the driver again, and sign the kernel module when prompted to do so.


ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

cjac avatar May 24 '23 23:05 cjac