[gpu] Install NVIDIA Container Toolkit #1025
Continuation of #1025
TO DO:
- ensure that the DOCKER optional component configures everything such that it can be used with yarn containers launched using nvidia-docker from the container toolkit
- Develop a working example of creating a cluster using the DOCKER optional component which is able to launch successfully completed pyspark jobs: Create the cluster
time gcloud dataproc clusters create ${CLUSTER_NAME} \
--optional-components DOCKER \
--scopes 'https://www.googleapis.com/auth/cloud-platform'
Launch the job
gsutil cp test.py gs://${BUCKET}/
gcloud dataproc jobs submit pyspark \
--properties="spark:spark.executor.resource.gpu.amount=1" \
--properties="spark:spark.task.resource.gpu.amount=1" \
--properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
--cluster=${CLUSTER_NAME} \
--region ${REGION} gs://${BUCKET}/test.py
test.py:
# Copyright 2022,2023 Google LLC and contributors
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("torch - tensorflow").getOrCreate()
import torch
print("get CUDA details : == : ")
use_cuda = torch.cuda.is_available()
if use_cuda:
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__CUDA Device Name:',torch.cuda.get_device_name(0))
print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)
import tensorflow as tf
print("Get GPU Details : ")
print(tf.test.is_gpu_available())
if tf.test.gpu_device_name():
print('Default GPU Device:{}'.format(tf.test.gpu_device_name()))
print("Please install GPU version of TF")
gpu_available = tf.test.is_gpu_available()
print("gpu_available : " + str(gpu_available))
is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))
is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
print("Run GPU Funtions Below : ")
print(get_available_gpus())
- patch install_gpu_driver.sh to check whether the DOCKER optional component has been enabled and on true, trigger the installation and testing of the nvidia container toolkit.
hmmm... maybe we should patch in a metadata argument to the installer: driver-only if set to true, do not install cuda or any of the other analytics infrastructure on the worker itself. These will be assumed to be installed in the container on which the workload will execute.
Clusters with this argument included will not be able to perform hardware assisted workloads directly. Jobs which expect hardware assisted workloads will need to manually install the libraries themselves or better yet, execute in a container.
/gcbrun
Inclusion of nvidia-container-toolkit went in with change #1190