initialization-actions [GPU] Driver installation not working and Dataproc 2.2 cluster creation is failing

Hi,

I am trying to attach GPUs to Dataproc 2.2 cluster, but it is breaking and cluster creation failing. Secure boot is disabled and I am using the latest install_gpu_driver.sh from this repository. I am getting the following error during cluster initialization now:-

++ tr '[:upper:]' '[:lower:]' ++ lsb_release -is

OS_NAME=debian ++ . /etc/os-release +++ PRETTY_NAME='Debian GNU/Linux 12 (bookworm)' +++ NAME='Debian GNU/Linux' +++ VERSION_ID=12 +++ VERSION='12 (bookworm)' +++ VERSION_CODENAME=bookworm +++ ID=debian +++ HOME_URL=https://www.debian.org/ +++ SUPPORT_URL=https://www.debian.org/support +++ BUG_REPORT_URL=https://bugs.debian.org/ ++ echo debian12
distribution=debian12
readonly OS_NAME ++ get_metadata_attribute dataproc-role ++ local -r attribute_name=dataproc-role ++ local -r default_value= ++ /usr/share/google/get_metadata_value attributes/dataproc-role
ROLE=Worker
readonly ROLE
DRIVER_FOR_CUDA=(['11.8']='525.147.05' ['12.1']='530.30.02' ['12.4']='550.54.14' ['12.5']='555.42.06' ['12.6']='560.28.03')
readonly -A DRIVER_FOR_CUDA
CUDNN_FOR_CUDA=(['11.8']='8.6.0.163' ['12.1']='8.9.0' ['12.4']='9.1.0.70' ['12.5']='9.2.1.18')
readonly -A CUDNN_FOR_CUDA
NCCL_FOR_CUDA=(['11.8']='2.15.5' ['12.1']='2.17.1' ['12.4']='2.21.5' ['12.5']='2.22.3')
readonly -A NCCL_FOR_CUDA
CUDA_SUBVER=(['11.8']='11.8.0' ['12.1']='12.1.0' ['12.4']='12.4.1' ['12.5']='12.5.1')
readonly -A CUDA_SUBVER ++ get_metadata_attribute rapids-runtime SPARK ++ local -r attribute_name=rapids-runtime ++ local -r default_value=SPARK ++ /usr/share/google/get_metadata_value attributes/rapids-runtime ++ echo -n SPARK
RAPIDS_RUNTIME=SPARK
readonly DEFAULT_CUDA_VERSION=12.4
DEFAULT_CUDA_VERSION=12.4 ++ get_metadata_attribute cuda-version 12.4 ++ local -r attribute_name=cuda-version ++ local -r default_value=12.4 ++ /usr/share/google/get_metadata_value attributes/cuda-version ++ echo -n 12.4
readonly CUDA_VERSION=12.4
CUDA_VERSION=12.4
readonly CUDA_FULL_VERSION=12.4.1
CUDA_FULL_VERSION=12.4.1
readonly DEFAULT_DRIVER=550.54.14
DEFAULT_DRIVER=550.54.14 ++ get_metadata_attribute gpu-driver-version 550.54.14 ++ local -r attribute_name=gpu-driver-version ++ local -r default_value=550.54.14 ++ /usr/share/google/get_metadata_value attributes/gpu-driver-version ++ echo -n 550.54.14
DRIVER_VERSION=550.54.14
is_debian11
is_debian ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release
[[ debian == \d\e\b\i\a\n ]] ++ os_version ++ xargs ++ cut -d= -f2 ++ grep '^VERSION_ID=' /etc/os-release
[[ 12 == \1\1* ]]
is_ubuntu22
is_ubuntu ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release
[[ debian == \u\b\u\n\t\u ]]
is_ubuntu20
is_ubuntu ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release
[[ debian == \u\b\u\n\t\u ]]
is_ubuntu20
is_ubuntu ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release
[[ debian == \u\b\u\n\t\u ]]
readonly DRIVER_VERSION
readonly DRIVER=550
DRIVER=550
readonly DEFAULT_CUDNN_VERSION=9.1.0.70
DEFAULT_CUDNN_VERSION=9.1.0.70 ++ get_metadata_attribute cudnn-version 9.1.0.70 ++ local -r attribute_name=cudnn-version ++ local -r default_value=9.1.0.70 ++ /usr/share/google/get_metadata_value attributes/cudnn-version ++ echo -n 9.1.0.70
CUDNN_VERSION=9.1.0.70
is_rocky ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \r\o\c\k\y ]]
is_ubuntu20
is_ubuntu ++ os_id ++ grep '^ID=' /etc/os-release ++ cut -d= -f2 ++ xargs
[[ debian == \u\b\u\n\t\u ]]
is_ubuntu22
is_ubuntu ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \u\b\u\n\t\u ]]
is_debian12
is_debian ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \d\e\b\i\a\n ]] ++ os_version ++ grep '^VERSION_ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ 12 == \1\2* ]]
is_cudnn8
[[ 9 == \8 ]]
is_ubuntu18
is_ubuntu ++ os_id ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release ++ xargs
[[ debian == \u\b\u\n\t\u ]]
is_debian10
is_debian ++ os_id ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release ++ xargs
[[ debian == \d\e\b\i\a\n ]] ++ os_version ++ cut -d= -f2 ++ grep '^VERSION_ID=' /etc/os-release ++ xargs
[[ 12 == \1\0* ]]
is_debian11
is_debian ++ os_id ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release ++ xargs
[[ debian == \d\e\b\i\a\n ]] ++ os_version ++ grep '^VERSION_ID=' /etc/os-release ++ cut -d= -f2 ++ xargs
[[ 12 == \1\1* ]]
readonly CUDNN_VERSION
readonly DEFAULT_NCCL_VERSION=2.21.5
DEFAULT_NCCL_VERSION=2.21.5 ++ get_metadata_attribute nccl-version 2.21.5 ++ local -r attribute_name=nccl-version ++ local -r default_value=2.21.5 ++ /usr/share/google/get_metadata_value attributes/nccl-version ++ echo -n 2.21.5
readonly NCCL_VERSION=2.21.5
NCCL_VERSION=2.21.5
readonly DEFAULT_USERSPACE_URL=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
DEFAULT_USERSPACE_URL=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run ++ get_metadata_attribute gpu-driver-url https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run ++ local -r attribute_name=gpu-driver-url ++ local -r default_value=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run ++ /usr/share/google/get_metadata_value attributes/gpu-driver-url ++ echo -n https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
readonly USERSPACE_URL=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
USERSPACE_URL=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
is_ubuntu22
is_ubuntu ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \u\b\u\n\t\u ]]
is_rocky9
is_rocky ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \r\o\c\k\y ]]
is_rocky ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \r\o\c\k\y ]] ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2 ++ os_vercat ++ is_ubuntu +++ os_id +++ grep '^ID=' /etc/os-release +++ xargs +++ cut -d= -f2 ++ [[ debian == \u\b\u\n\t\u ]] ++ is_rocky +++ os_id +++ xargs +++ cut -d= -f2 +++ grep '^ID=' /etc/os-release ++ [[ debian == \r\o\c\k\y ]] ++ os_version ++ xargs ++ cut -d= -f2 ++ grep '^VERSION_ID=' /etc/os-release
shortname=debian12
nccl_shortname=debian12
readonly NVIDIA_BASE_DL_URL=https://developer.download.nvidia.com/compute
NVIDIA_BASE_DL_URL=https://developer.download.nvidia.com/compute
readonly NVIDIA_REPO_URL=https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64
NVIDIA_REPO_URL=https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64
readonly DEFAULT_NCCL_REPO_URL=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
DEFAULT_NCCL_REPO_URL=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb ++ get_metadata_attribute nccl-repo-url https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb ++ local -r attribute_name=nccl-repo-url ++ local -r default_value=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb ++ /usr/share/google/get_metadata_value attributes/nccl-repo-url ++ echo -n https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
NCCL_REPO_URL=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
readonly NCCL_REPO_URL
readonly NCCL_REPO_KEY=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/7fa2af80.pub
NCCL_REPO_KEY=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/7fa2af80.pub
DEFAULT_NVIDIA_CUDA_URLS=(['11.8']='https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run' ['12.1']='https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run' ['12.4']='https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run')
readonly -A DEFAULT_NVIDIA_CUDA_URLS
readonly DEFAULT_NVIDIA_CUDA_URL=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
DEFAULT_NVIDIA_CUDA_URL=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run ++ get_metadata_attribute cuda-url https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run ++ local -r attribute_name=cuda-url ++ local -r default_value=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run ++ /usr/share/google/get_metadata_value attributes/cuda-url ++ echo -n https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
NVIDIA_CUDA_URL=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
readonly NVIDIA_CUDA_URL
readonly NVIDIA_ROCKY_REPO_URL=https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-debian12.repo
NVIDIA_ROCKY_REPO_URL=https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-debian12.repo
CUDNN_TARBALL=cudnn-12.4-linux-x64-v9.1.0.70.tgz
CUDNN_TARBALL_URL=https://developer.download.nvidia.com/compute/redist/cudnn/v9.1.0/cudnn-12.4-linux-x64-v9.1.0.70.tgz
compare_versions_lte 8.3.1.22 9.1.0.70 ++ echo -e '8.3.1.22\n9.1.0.70' ++ head -n1 ++ sort -V
'[' 8.3.1.22 = 8.3.1.22 ']'
CUDNN_TARBALL=cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz
compare_versions_lte 9.1.0.70 8.4.1.50 ++ echo -e '9.1.0.70\n8.4.1.50' ++ head -n1 ++ sort -V
'[' 9.1.0.70 = 8.4.1.50 ']'
CUDNN_TARBALL_URL=https://developer.download.nvidia.com/compute/redist/cudnn/v9.1.0/local_installers/12.4/cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz
compare_versions_lte 12.0 12.4 ++ echo -e '12.0\n12.4' ++ head -n1 ++ sort -V
'[' 12.0 = 12.0 ']'
CUDNN_TARBALL_URL=https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.2.0.82_cuda12-archive.tar.xz
readonly CUDNN_TARBALL
readonly CUDNN_TARBALL_URL ++ get_metadata_attribute gpu-driver-provider NVIDIA ++ local -r attribute_name=gpu-driver-provider ++ local -r default_value=NVIDIA ++ /usr/share/google/get_metadata_value attributes/gpu-driver-provider ++ echo -n NVIDIA
GPU_DRIVER_PROVIDER=NVIDIA
readonly GPU_DRIVER_PROVIDER
readonly GPU_AGENT_REPO_URL=https://raw.githubusercontent.com/GoogleCloudPlatform/ml-on-gcp/master/dlvm/gcp-gpu-utilization-metrics
GPU_AGENT_REPO_URL=https://raw.githubusercontent.com/GoogleCloudPlatform/ml-on-gcp/master/dlvm/gcp-gpu-utilization-metrics ++ get_metadata_attribute install-gpu-agent false ++ local -r attribute_name=install-gpu-agent ++ local -r default_value=false ++ /usr/share/google/get_metadata_value attributes/install-gpu-agent ++ echo -n false
INSTALL_GPU_AGENT=false
readonly INSTALL_GPU_AGENT
readonly HADOOP_CONF_DIR=/etc/hadoop/conf
HADOOP_CONF_DIR=/etc/hadoop/conf
readonly HIVE_CONF_DIR=/etc/hive/conf
HIVE_CONF_DIR=/etc/hive/conf
readonly SPARK_CONF_DIR=/etc/spark/conf
SPARK_CONF_DIR=/etc/spark/conf
NVIDIA_SMI_PATH=/usr/bin
MIG_MAJOR_CAPS=0
IS_MIG_ENABLED=0
CUDA_KEYRING_PKG_INSTALLED=0
CUDA_LOCAL_REPO_INSTALLED=0
CUDNN_LOCAL_REPO_INSTALLED=0
CUDNN_PKG_NAME=
CUDNN8_LOCAL_REPO_INSTALLED=0
CUDNN8_PKG_NAME= ++ mktemp -u -d -p /run/tmp -t ca_dir-XXXX
CA_TMPDIR=/run/tmp/ca_dir-iIJ4 ++ get_metadata_attribute private_secret_name ++ local -r attribute_name=private_secret_name ++ local -r default_value= ++ /usr/share/google/get_metadata_value attributes/private_secret_name ++ echo -n ''
PSN=
readonly PSN ++ uname -r
readonly uname_r=6.1.0-25-cloud-amd64
uname_r=6.1.0-25-cloud-amd64
readonly bdcfg=/usr/local/bin/bdconfig
bdcfg=/usr/local/bin/bdconfig
nvsmi_works=0
main
is_debian ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \d\e\b\i\a\n ]]
remove_old_backports
is_debian12
is_debian ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \d\e\b\i\a\n ]] ++ os_version ++ grep '^VERSION_ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ 12 == \1\2* ]]
return
is_debian ++ os_id ++ grep '^ID=' /etc/os-release ++ xargs ++ cut -d= -f2
[[ debian == \d\e\b\i\a\n ]]
export DEBIAN_FRONTEND=noninteractive
DEBIAN_FRONTEND=noninteractive
execute_with_retries 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64'
local -r 'cmd=apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64'
(( i = 0 ))
(( i < 3 ))
eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
sleep 5
(( i++ ))
(( i < 3 ))
eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
sleep 5
(( i++ ))
(( i < 3 ))
eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
sleep 5
(( i++ ))
(( i < 3 ))
return 1

Please let me know if I am missing anything or is there any work around to proceed further?

Sep 25 '24 21:09 santhoshvly

This is due to apt-key add - deprecation enforcement. The trust databases need to be separated into their own files and referenced by path in the sources.list file for each repo. I have an impementation complete in my rapids branch. I could integrate into master.

https://github.com/cjac/initialization-actions/blob/rapids-20240806/gpu/install_gpu_driver.sh#L1077

Santosh, did you say you've tried this workaround and that it's unblocked you?

Sep 26 '24 17:09 cjac

Please review and test #1240

Sep 26 '24 17:09 cjac

@cjac Yes, I tried with the workaround script you mentioned but still breaking with similar error in Dataproc 2.2

-----END PGP PUBLIC KEY BLOCK-----'

sed -i -e 's:deb https:deb [signed-by=/usr/share/keyrings/mysql.gpg] https:g' /etc/apt/sources.list.d/mysql.list rm -rf /etc/apt/trusted.gpg main is_debian ++ os_id ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release ++ xargs [[ debian == \d\e\b\i\a\n ]] remove_old_backports is_debian12 is_debian ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release [[ debian == \d\e\b\i\a\n ]] ++ os_version ++ xargs ++ cut -d= -f2 ++ grep '^VERSION_ID=' /etc/os-release [[ 12 == \1\2* ]] return is_debian ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release [[ debian == \d\e\b\i\a\n ]] export DEBIAN_FRONTEND=noninteractive DEBIAN_FRONTEND=noninteractive execute_with_retries 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' local -r 'cmd=apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' (( i = 0 )) (( i < 3 )) eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages. sleep 5 (( i++ )) (( i < 3 )) eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages. sleep 5 (( i++ )) (( i < 3 )) eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages. sleep 5 (( i++ )) (( i < 3 )) return 1

Sep 26 '24 17:09 santhoshvly

@cjac I have disabled secure boot in dataproc. Is that okay or should we enable it for this workaround?

Sep 26 '24 17:09 santhoshvly

to use secure boot, you'll need to build a custom image. Instructions here:

https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot

You do not need secure boot enabled for the workaround to function. I think you may just be missing an apt-get update after the sources.list files are cleaned up and the trust keys are written to /usr/share/keyrings

Sep 26 '24 18:09 cjac

package cache update command included in #1240 as commit 234515d674b73ce8f191184c950535975fc5acaf

Sep 26 '24 18:09 cjac

@cjac I tried with that but still breaking with same error

Sep 26 '24 20:09 santhoshvly

I forgot that I'm pinned to 2.2.20-debian12

I'll try to make it work with the latest from the 2.2 line.

Sep 26 '24 21:09 cjac

Okay, Thank you. I am getting the error in, 2.2.32-debian12.

Sep 26 '24 21:09 santhoshvly

this might do it:

if is_debian ; then
  clean_up_sources_lists
  apt-get update
  export DEBIAN_FRONTEND="noninteractive"
  echo "Begin full upgrade"
  date
  apt-get --yes -qq -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" full-upgrade

  date
  echo "End full upgrade"
  pkgs="$(apt-get -y full-upgrade 2>&1 | grep -A9 'The following packages have been kept back:' | grep '^ ')"
  apt-get install -y --allow-change-held-packages -qq ${pkgs}
fi

Sep 26 '24 22:09 cjac

yes, that last iteration does seem to get the installer working for me on 2.2 latest

Sep 26 '24 22:09 cjac

@cjac Thank you. I tried with the above changes but the cluster creation still failed. It didn't give the previous package installation error and looks good in init script logs, last few lines of install_gpu_dirver.sh script below:-

pdate-alternatives: using /usr/lib/mesa-diverted to provide /usr/lib/glx (glx) in auto mode Processing triggers for initramfs-tools (0.142+deb12u1) ... update-initramfs: Generating /boot/initrd.img-6.1.0-25-cloud-amd64 Processing triggers for libc-bin (2.36-9+deb12u8) ... Processing triggers for man-db (2.11.2-2) ... Processing triggers for glx-alternative-mesa (1.2.2) ... update-alternatives: updating alternative /usr/lib/mesa-diverted because link group glx has changed slave links Setting up glx-alternative-nvidia (1.2.2) ... Processing triggers for glx-alternative-nvidia (1.2.2) ... Setting up nvidia-alternative (550.54.14-1) ... Processing triggers for nvidia-alternative (550.54.14-1) ... update-alternatives: using /usr/lib/nvidia/current to provide /usr/lib/nvidia/nvidia (nvidia) in auto mode Setting up nvidia-kernel-support (550.54.14-1) ... Setting up libnvidia-ml1:amd64 (550.54.14-1) ... Setting up nvidia-smi (550.54.14-1) ... Processing triggers for nvidia-alternative (550.54.14-1) ... update-alternatives: updating alternative /usr/lib/nvidia/current because link group nvidia has changed slave links Setting up nvidia-kernel-open-dkms (550.54.14-1) ... Loading new nvidia-current-550.54.14 DKMS files... Building for 6.1.0-25-cloud-amd64 Building initial module for 6.1.0-25-cloud-amd64

I am seeing the following error in Dataproc logs:-

DEFAULT 2024-09-27T02:58:49.624652770Z Setting up xserver-xorg-video-nvidia (560.35.03-1) ... DEFAULT 2024-09-27T02:58:49.807012159Z Redundant argument in sprintf at /usr/share/perl5/Debconf/Element/Noninteractive/Error.pm line 54, <GEN0> line 9. DEFAULT 2024-09-27T02:58:49.861225948Z Configuring xserver-xorg-video-nvidia DEFAULT 2024-09-27T02:58:49.861229190Z ------------------------------------- DEFAULT 2024-09-27T02:58:49.861229697Z DEFAULT 2024-09-27T02:58:49.861230128Z Mismatching nvidia kernel module loaded DEFAULT 2024-09-27T02:58:49.861230920Z DEFAULT 2024-09-27T02:58:49.861231625Z The NVIDIA driver that is being installed (version 560.35.03) does not DEFAULT 2024-09-27T02:58:49.861232509Z match the nvidia kernel module currently loaded (version for). DEFAULT 2024-09-27T02:58:49.861233023Z DEFAULT 2024-09-27T02:58:49.861233385Z The X server, OpenGL, and GPGPU applications may not work properly. DEFAULT 2024-09-27T02:58:49.861233726Z DEFAULT 2024-09-27T02:58:49.861234284Z The easiest way to fix this is to reboot the machine once the installation DEFAULT 2024-09-27T02:58:49.861235078Z has finished. You can also stop the X server (usually by stopping the login DEFAULT 2024-09-27T02:58:49.861235792Z manager, e.g. gdm3, sddm, or xdm), manually unload the module ("modprobe -r DEFAULT 2024-09-27T02:58:49.861236740Z nvidia"), and restart the X server.

I think this error caused the cluster creation failure.

Sep 27 '24 14:09 santhoshvly

@cjac We are unable to create dataproc GPU cluster since Dataproc 2.1/2/2 upgrade . Please let me know if there are any workaround to proceed with cluster creation.

Sep 27 '24 20:09 santhoshvly

I did publish another version since last we spoke. Can you please review the code at https://github.com/GoogleCloudDataproc/initialization-actions/pull/1240/files please? The tests passed last commit but took 2 hours and one minute to complete. This latest update should reduce the runtime significantly.

Sep 27 '24 23:09 cjac

I received those messages as well, but they should just be warnings. Does the new change get things working?

Sep 28 '24 17:09 cjac

@cjac I tried the latest script but dataproc initialization action is breaking with timeout error and cluster is not starting:-

name: "gs://syn-development-kub/syn-cluster-config/install_gpu_driver.sh" type: INIT_ACTION state: FAILED start_time { seconds: 1727708007 nanos: 938000000 } end_time { seconds: 1727708408 nanos: 209000000 } error_detail: "Initialization action timed out. Failed action 'gs://syn-development-kub/syn-cluster-config/install_gpu_driver.sh', see output in: gs://syn-development-kub/google-cloud-dataproc-metainfo/20d0767a-6c0a-4eea-a0de-6ba1cc16207a/dataproc-22-gpu-test-691fd61a-a3ec9b72-w-0/dataproc-initialization-script-0_output" error_code: TASK_FAILED

I couldn't find any error details in the init script output. I am attaching the init script output for your reference. google-cloud-dataproc-metainfo_20d0767a-6c0a-4eea-a0de-6ba1cc16207a_dataproc-22-gpu-test-691fd61a-a3ec9b72-w-0_dataproc-initialization-script-0_output.txt

Sep 30 '24 15:09 santhoshvly

Can you increase your timeout by 5-10 minutes? I do have a fix that's in the works for the base image, and once it gets published, we should be able to skip the full upgrade in the init action.

Sep 30 '24 16:09 cjac

Here is a recent cluster build I did in my repro lab. It took 14m47.946s:

Fri Sep 27 04:49:21 PM PDT 2024
+ gcloud dataproc clusters create cluster-1718310842 --master-boot-disk-type pd-ssd --worker-boot-disk-type pd-ssd --secondary-worker-boot-disk-type pd-ssd --num-masters=1 --num-workers=2 --master-boot-disk-size 100 --worker-boot-disk-size 100 --secondary-worker-boot-disk-size 50 --master-machine-type n1-standard-16 --worker-machine-type n1-standard-16 --master-accelerator type=nvidia-tesla-t4 --worker-accelerator type=nvidia-tesla-t4 --region us-west4 --zone us-west4-a --subnet subnet-cluster-1718310842 --no-address --service-account=sa-cluster-1718310842@cjac-2021-00.iam.gserviceaccount.com --tags=tag-cluster-1718310842 --bucket cjac-dataproc-repro-1718310842 --enable-component-gateway --metadata install-gpu-agent=true --metadata gpu-driver-provider=NVIDIA --metadata public_secret_name=efi-db-pub-key-042 --metadata private_secret_name=efi-db-priv-key-042 --metadata secret_project=cjac-2021-00 --metadata secret_version=1 --metadata modulus_md5sum=d41d8cd98f00b204e9800998ecf8427e --metadata dask-runtime=yarn --metadata bigtable-instance=cjac-bigtable0 --metadata rapids-runtime=SPARK --initialization-actions gs://cjac-dataproc-repro-1718310842/dataproc-initialization-actions/gpu/install_gpu_driver.sh,gs://cjac-dataproc-repro-1718310842/dataproc-initialization-actions/dask/dask.sh,gs://cjac-dataproc-repro-1718310842/dataproc-initialization-actions/rapids/rapids.sh --initialization-action-timeout=90m --metadata bigtable-instance=cjac-bigtable0 --no-shielded-secure-boot --image-version 2.2 --max-idle=8h --scopes https://www.googleapis.com/auth/cloud-platform,sql-admin
Waiting on operation [projects/cjac-2021-00/regions/us-west4/operations/094ca004-2e9f-32f6-94e1-53c8f6799624].
Waiting for cluster creation operation...⠛
WARNING: Consider using Auto Zone rather than selecting a zone manually. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone
Waiting for cluster creation operation...done.
Created [https://dataproc.googleapis.com/v1/projects/cjac-2021-00/regions/us-west4/clusters/cluster-1718310842] Cluster placed in zone [us-west4-a].

real    14m47.946s
user    0m4.854s
sys     0m0.426s
+ date
Fri Sep 27 05:04:09 PM PDT 2024

Sep 30 '24 18:09 cjac

I see that I hard-coded a regional bucket path into the code. this will slow things down when running outside of us-west4 ; I'll fix that next.

Sep 30 '24 19:09 cjac

@cjac Adding timeout fixed the error and created cluster. We are able to run GPU workloads in the cluster. Thank you so much for the support!!.

Oct 01 '24 03:10 santhoshvly

Glad I could help!

I'll work on getting these changes integrated into the base image.

Oct 01 '24 04:10 cjac

@cjac I have created GPU clusters using nvidia-tesla-t4 multiple times and it worked fine. But cluster creation is taking too long and failing with following error when we try to use nvidia-tesla-p4 GPU type. Do you know if Dataproc has any issue with this GPU type?

is_src_nvidia
[[ NVIDIA == \N\V\I\D\I\A ]]
prefix=nvidia-current
local suffix
for suffix in uvm peermem modeset drm
echo 'alias nvidia-uvm nvidia-current-uvm'
for suffix in uvm peermem modeset drm
echo 'alias nvidia-peermem nvidia-current-peermem'
for suffix in uvm peermem modeset drm
echo 'alias nvidia-modeset nvidia-current-modeset'
for suffix in uvm peermem modeset drm
echo 'alias nvidia-drm nvidia-current-drm'
echo 'alias nvidia nvidia-current'
depmod -a
modprobe nvidia modprobe: ERROR: could not insert 'nvidia_current': No such device

Oct 17 '24 21:10 santhoshvly

thanks for writing, @santhoshvly

That is correct, the P4[1] GPUs are no longer supported[4] since the kernel requires GPL licensing, and the older drivers were not released under that license. Wish I could help.

Please try T4[2] or L4[3] for similar cost for performance. I run my tests on n1-standard-4 + 1 single T4 for each master and worker node. I burst up to H100 for some tests.

[1] https://github.com/NVIDIA/open-gpu-kernel-modules/issues/19 [2] https://cloud.google.com/compute/docs/gpus#t4-gpus [3] https://cloud.google.com/compute/docs/gpus#l4-gpus [4] https://forums.developer.nvidia.com/t/390-154-driver-no-longer-works-with-kernel-6-0/230959

Oct 17 '24 23:10 cjac

It may be possible to build kernel drivers from an image released before 2023, but it's not a really great long-term solution, and I have not confirmed that it would work. Can you move off of the P4 hardware?

Oct 18 '24 01:10 cjac

Can you increase your timeout by 5-10 minutes? I do have a fix that's in the works for the base image, and once it gets published, we should be able to skip the full upgrade in the init action.

I removed the full upgrade from the init action. We need to unhold systemd related packages for installation of pciutils to succeed. I included the patch to move off of /etc/apt/trusted.gpg to files under /usr/share/keyrings/ referenced directly inline from the files in /etc/apt/sources.list.d/*.list

Oct 18 '24 01:10 cjac

The code to clean up apt gpg trust databases and unhold systemd went in to #1240 and #1242

I spoke with engineering and they do not feel comfortable unholding the package while their builder is executing. They recommended that we unhold in any init action which would fail with the hold in place. I have placed the hold again after the package installation in the init script.

Oct 18 '24 02:10 cjac

@cjac Thank you so much for sharing the details. Yea, we can move out of P4 hardware. But the documentation is still not updated and confusing the users, https://cloud.google.com/compute/docs/gpus#p4-gpus

Oct 18 '24 15:10 santhoshvly

Thanks for the information. There may be other GCE use cases where drivers are pre-built. In this case, P4 may still be supported. But I've opened an internal issue to track an update to the documentation.

Oct 18 '24 16:10 cjac

Hello @santhoshvly is your use case working with the latest changes?

Dec 02 '24 18:12 cjac