ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

Unable to recognize GPU address - Spark Distributor Tensorflow

Open ghost opened this issue 4 years ago • 0 comments

I currently have a local spark cluster 3.0 which consists of 3 machines. Two machines have 2 NVIDIA GPUS and One machine is the spark client master which has no NVIDIA GPU. When I create a spark cluster, I see it recognizes the GPUs as resources on the dashboard. I'm trying to run the example posted for the Spark Distributor Tensorflow page. When I create a spark context:

sc = pyspark.SparkContext(master = "spark://192.168.1.113:7077", 
                         appName="Spark GPU"
                          )

I see that the GPUs are being utilized as resource executors.

However, when I run the following:

MirroredStrategyRunner(num_slots=8).run(train)

It results in the following errors:

raise ValueError(f'Found GPU addresses {addresses} which '
ValueError: Found GPU addresses [''] which are not all in the correct format for CUDA_VISIBLE_DEVICES, which requires integers with no zero padding.

I'm not sure why it wasn't able to detect the GPUs on the remote machines.

ghost avatar Sep 22 '20 11:09 ghost