initialization-actions [gpu] Install NVIDIA Container Toolkit #1025

Continuation of #1025

Jun 30 '23 21:06 cjac

TO DO:

ensure that the DOCKER optional component configures everything such that it can be used with yarn containers launched using nvidia-docker from the container toolkit
Develop a working example of creating a cluster using the DOCKER optional component which is able to launch successfully completed pyspark jobs: Create the cluster

  time gcloud dataproc clusters create ${CLUSTER_NAME} \
    --optional-components DOCKER \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'

Launch the job

gsutil cp test.py gs://${BUCKET}/
gcloud dataproc jobs submit pyspark \
  --properties="spark:spark.executor.resource.gpu.amount=1" \
  --properties="spark:spark.task.resource.gpu.amount=1" \
  --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
  --cluster=${CLUSTER_NAME} \
  --region ${REGION} gs://${BUCKET}/test.py

test.py:

# Copyright 2022,2023 Google LLC and contributors
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("torch - tensorflow").getOrCreate()

import torch
print("get CUDA details : == : ")
use_cuda = torch.cuda.is_available()
if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    print('__CUDA Device Name:',torch.cuda.get_device_name(0))
    print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)

import tensorflow as tf
print("Get GPU Details : ")
print(tf.test.is_gpu_available())

if tf.test.gpu_device_name():
    print('Default GPU Device:{}'.format(tf.test.gpu_device_name()))
    print("Please install GPU version of TF")

gpu_available = tf.test.is_gpu_available()
print("gpu_available : " + str(gpu_available))

is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))

is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']


print("Run GPU Funtions Below : ")
print(get_available_gpus())

patch install_gpu_driver.sh to check whether the DOCKER optional component has been enabled and on true, trigger the installation and testing of the nvidia container toolkit.

Jun 30 '23 21:06 cjac

hmmm... maybe we should patch in a metadata argument to the installer: driver-only if set to true, do not install cuda or any of the other analytics infrastructure on the worker itself. These will be assumed to be installed in the container on which the workload will execute.

Clusters with this argument included will not be able to perform hardware assisted workloads directly. Jobs which expect hardware assisted workloads will need to manually install the libraries themselves or better yet, execute in a container.

Jun 30 '23 21:06 cjac

/gcbrun

Apr 19 '24 05:04 jayadeep-jayaraman

Inclusion of nvidia-container-toolkit went in with change #1190

Jun 16 '25 18:06 cjac