ecosystem
ecosystem copied to clipboard
Unable to recognize GPU address - Spark Distributor Tensorflow
I currently have a local spark cluster 3.0 which consists of 3 machines. Two machines have 2 NVIDIA GPUS and One machine is the spark client master which has no NVIDIA GPU. When I create a spark cluster, I see it recognizes the GPUs as resources on the dashboard. I'm trying to run the example posted for the Spark Distributor Tensorflow page. When I create a spark context:
sc = pyspark.SparkContext(master = "spark://192.168.1.113:7077",
appName="Spark GPU"
)
I see that the GPUs are being utilized as resource executors.
However, when I run the following:
MirroredStrategyRunner(num_slots=8).run(train)
It results in the following errors:
raise ValueError(f'Found GPU addresses {addresses} which '
ValueError: Found GPU addresses [''] which are not all in the correct format for CUDA_VISIBLE_DEVICES, which requires integers with no zero padding.
I'm not sure why it wasn't able to detect the GPUs on the remote machines.